Re: Investigating a hung job on the autobuilder - a HOWTO


Ross Burton
 

A little script I have locally which is useful for finding bitbake instances:

 

for PID in $(pgrep '^Cooker$'); do

                pstree -p -l $PID

                echo

done

 

Output from a build I just fired:

 

Cooker(2380890)──Cooker(2380899)

                └─Worker(2381485)──Worker(2402174)

                                  ─attr:patch(2401804)───sh(2402170)

                                  ─bash-completion(2401738)

                                  ─bzip2:patch(2401753)

                                  ─cracklib:patch(2401737)

                                  ─curl-native:unp(2401716)───sh(2401909)──tar(2401930)

                                  │                                        └─xz(2401929)

                                  ─cve-update-db-n(2390133)

                                  ─edk2-firmware:u(2401360)

                                  ─flex:patch(2401736)

                                  ─gcc-source-12.1(2401703)───sh(2401764)──tar(2401778)

                                  │                                        └─xz(2401777)

                                  ─glib-2.0-native(2401763)───sh(2402164)

                                  ─glibc:patch(2401705)───sh(2402159)───quilt(2402175)

                                  ─gnutls-native:u(2401717)───sh(2401938)──tar(2402023)

                                  │                                        └─xz(2402020)

                                  ─initscripts:pat(2401783)───sh(2402167)

                                  ─libcap-ng:patch(2401757)───sh(2402173)

                                  ─libffi:patch(2401755)───sh(2402160)───xargs(2402177)───sed(2402181)

                                  ─libpam:unpack(2401745)

                                  ─libtirpc:patch(2401750)

                                  ─libtool-cross:p(2401719)

                                  ─libxcrypt:patch(2401725)

                                  ─libxml2-native:(2401734)

                                  ─linux-libc-head(2401701)───sh(2401818)──tar(2401824)

                                  │                                        └─xz(2401823)

                                  ─linux-yocto:ker(2400552)───run.do_kernel_c(2400801)

                                  ─lz4-native:comp(2400716)───run.do_compile.(2400804)───make(2400842)───make(2400866)───gcc-10(2400925)──as(2401679)

                                  │                                                                                                       └─cc1(2401678)

                                  ─ncurses:patch(2401729)───sh(2402156)───quilt(2402178)

                                  ─openssl:patch(2401816)

                                  ─opkg-utils:patc(2401721)

                                  ─python3-native:(2401712)───sh(2402155)──tar(2402180)

                                  │                                        └─xz(2402179)

                                  ─update-rc.d:pat(2401914)

                                  ─util-linux-libu(2401747)

                                  ─xz:patch(2401774)───sh(2402165)

                                  ─zlib:unpack(2401723)

                                  └─{Worker}(2381486)

 

Ross

 

From: richard.purdie@... <richard.purdie@...>
Date: Thursday, 19 May 2022 at 13:43
To: swat <swat@...>, openembedded-core <openembedded-core@...>
Cc: Luca Ceresoli <luca.ceresoli@...>, Alexandre Belloni <alexandre.belloni@...>, Ross Burton <Ross.Burton@...>
Subject: Investigating a hung job on the autobuilder - a HOWTO

Investigating a hung job on the autobuilder

We've yet another hung job on the autobuilder. I thought I'd write down
what I did to investigate it so others can learn how to do it and so I
can remember next time I need to do it too.

The problem was this job:

https://autobuilder.yoctoproject.org/typhoon/#/builders/97/builds/4563

qemuarm64-armhost running on an aarch64 worker, hung in do_testsdkext
for core-image-minimal. To get further you need ssh access. We do give
out ssh access to people who need it.

ssh to ubuntu1804-arm-1.yocto.io and then "sudo -iu pokybuild" lets us
have a look at what is going on. Easy first step is:

$ ps ax | grep armhost
28429 ?        S      0:00 python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/yocto-autobuilder-helper/scripts/run-config qemuarm64-armhost /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build lucaceresoli/master-next ssh://git@.../poky-contrib --sstateprefix  --buildappsrcrev  --publish-dir  --build-type full --workername ubuntu1804-arm-1 --build-url https://autobuilder.yoctoproject.org/typhoon/#builders/97/builds/4563 --results-dir /srv/autobuilder/autobuilder.yocto.io/pub/non-release/20220517-8/testresults --quietlogging --stepname test-targets --phase 2
28430 ?        S      0:00 /bin/sh -c . ./oe-init-build-env; /home/pokybuild/yocto-worker/qemuarm64-armhost/yocto-autobuilder-helper/scripts/checkvnc; DISPLAY=:1 bitbake core-image-minimal:do_testimage core-image-sato:do_testimage core-image-sato-sdk:do_testimage core-image-sato:do_testsdk  core-image-minimal:do_testsdkext core-image-sato:do_testsdkext -k
28460 ?        Sl     0:32 python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake core-image-minimal:do_testimage core-image-sato:do_testimage core-image-sato-sdk:do_testimage core-image-sato:do_testsdk core-image-minimal:do_testsdkext core-image-sato:do_testsdkext -k
28462 ?        Sl     6:43 bitbake-server /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake-server decafbad 3 5 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/bitbake-cookerdaemon.log /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/bitbake.lock /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/bitbake.sock 0 None 0
28504 ?        Sl     0:17 python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake-worker decafbad
28515 ?        SNs    0:00 python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake-worker decafbad
29659 pts/1    S+     0:00 grep armhost
38713 ?        SN     0:00 /bin/sh -c . /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/environment-setup-cortexa57-poky-linux > /dev/null; devtool sdk-install meta-extsdk-toolchain
38719 ?        SN     0:01 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/usr/bin/python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/sysroots/aarch64-pokysdk-linux/usr/bin/devtool sdk-install meta-extsdk-toolchain

i.e. looking for processes that are running in that build directory.

The relevant part of the process tree from "ps axjf":

 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    1  3501  3500  3500 ?           -1 Sl    6000   2:29 /usr/bin/python3 /usr/local/bin/buildbot-worker start /home/pokybuild/yocto-worker/
 3501 28429 28429  3500 ?           -1 S     6000   0:00  \_ python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/yocto-autobuilder-helper/scripts/
28429 28430 28429  3500 ?           -1 S     6000   0:00      \_ /bin/sh -c . ./oe-init-build-env; /home/pokybuild/yocto-worker/qemuarm64-armhost/yoc
28430 28460 28429  3500 ?           -1 Sl    6000   0:32          \_ python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake
    1 28462 28461 28461 ?           -1 Sl    6000   6:43 bitbake-server /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake-serv
28462 28504 28461 28461 ?           -1 Sl    6000   0:17  \_ python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake-worker
28504 28515 28515 28515 ?           -1 SNs   6000   0:00      \_ python3 /home/pokybuild/yocto-worker/qemuarm64-armhost/build/bitbake/bin/bitbake-wor
28515 38713 28515 28515 ?           -1 SN    6000   0:00          \_ /bin/sh -c . /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work
38713 38719 28515 28515 ?           -1 SN    6000   0:01              \_ /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm6

so the hung process is 38719.

I did have a look at the log files in the build directory but in this
case I didn't find anything. The bitbake-cookerdaemon.log in particular
can sometimes help. I had a look at the open files for the process:

$ ls /proc/38719/fd -la
total 0
dr-x------ 2 pokybuild pokybuild  0 May 19 12:08 .
dr-xr-xr-x 9 pokybuild pokybuild  0 May 19 11:59 ..
lr-x------ 1 pokybuild pokybuild 64 May 19 12:08 0 -> /dev/null
l-wx------ 1 pokybuild pokybuild 64 May 19 12:08 1 -> 'pipe:[122175861]'
l-wx------ 1 pokybuild pokybuild 64 May 19 12:08 2 -> 'pipe:[122175861]'
lr-x------ 1 pokybuild pokybuild 64 May 19 12:08 4 -> 'pipe:[122140435]'

but nothing interesting there.

I couldn't use strace as the permissions weren't setup for that without
root. We don't normally give out that access but to debug this I went
ahead and looked with strace:

strace -p 38719
strace: Process 38719 attached
futex(0xffff9c2da2bc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff

so it is stuck in a futex. I then went ahead and tried gdb:

gdb -p 38719

BFD: warning: /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/usr/bin/python3.10.real: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0000000
BFD: warning: /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0000000
Reading symbols from /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6...(no debugging symbols found)...done.

(gdb) bt
Python Exception <class 'gdb.error'> Call Frame Instruction op 45 in vendor extension space is not handled on this architecture.:
#0  0x0000ffff9bd1e16c in ?? ()
   from /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6
Call Frame Instruction op 45 in vendor extension space is not handled on this architecture.

I've abbreviated the initial output. We can see that it is using python
3.10 from within builtdools tarball within an eSDK and we have no debug
symbols. The lack of a backtrace was worrying, after consulting some
ARM experts it was concluded the version of gdb was too old for the
newer glibc in buildtools.

After much thinking, I decided to enable and build a gdb-native. I did
something horrible and moved the lock and sock files out the way and
then a "bitbake gdb-native" in the same build directory. We don't have
a gdb-native recipe however adding a BBCLASSEXTEND = "native", a
PACKAGECONFIG to enable python in the native case and disable LLTNGUST
for native. A "bitbake gdb-native -c addto_recipe_sysroot" made it
available in the gdb-native WORKDIR where I could then run it from:

/home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/aarch64-linux/gdb-native/12.1-r0/recipe-sysroot-native/usr/bin/gdb -p 3871

I did try using gdb-cross-aarch64 but that wouldn't attach to local
processes.

Sadly there were no symbols so the backtrace output was still not
useful other than seeing we were in some libpthread function
(unsurprisingly). I realised I could go into the python3-nativesdk
workdir, find the nativesdk-python3-dbg package (I used the deb) and
then copy the .debug files from that into the
testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/usr/lib/ directory
for the python libs.

With that, I got some more interesting output:

(gdb) bt
#0  0x0000ffff9bd1e16c in ?? ()
   from /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6
#1  0x0000ffff9bd20d88 in pthread_cond_wait ()
   from /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6
#2  0x0000ffff9c0db3a8 in drop_gil (ceval=ceval@entry=0xffff9c2da208 <_PyRuntime+344>, ceval2=ceval2@entry=0xaaab09af3f20,
    tstate=tstate@entry=0xaaab09ab9ba0) at ../Python-3.10.4/Python/ceval_gil.h:182
#3  0x0000ffff9bfc8708 in eval_frame_handle_pending (tstate=0xaaab09ab9ba0) at ../Python-3.10.4/Python/ceval.c:1185
#4  _PyEval_EvalFrameDefault (tstate=0xaaab09ab9ba0, f=0xffff9445edd0, throwflag=<optimized out>) at ../Python-3.10.4/Python/ceval.c:1775
#5  0x0000ffff9c0de0a8 in _PyEval_EvalFrame (throwflag=0, f=0xffff9445edd0, tstate=0xaaab09ab9ba0)
    at ../Python-3.10.4/Include/internal/pycore_ceval.h:46
#6  _PyEval_Vector (tstate=0xaaab09ab9ba0, con=0xffff95a20050, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>,
    kwnames=<optimized out>) at ../Python-3.10.4/Python/ceval.c:5065
#7  0x0000ffff9c13a35c in _PyObject_VectorcallTstate (nargsf=9223372036854775809, kwnames=0x0, args=0xffffd3ebd780, callable=0xffff95a20040,
    tstate=0xaaab09ab9ba0) at ../Python-3.10.4/Include/cpython/abstract.h:114
#8  PyObject_CallOneArg (arg=0xffff95ac8130, func=0xffff95a20040) at ../Python-3.10.4/Include/cpython/abstract.h:184
#9  handle_weakrefs (old=0xaaab09af4190, unreachable=0xffffd3ebd748) at ../Python-3.10.4/Modules/gcmodule.c:887
#10 gc_collect_main (tstate=tstate@entry=0xaaab09ab9ba0, generation=generation@entry=2, n_collected=n_collected@entry=0xffffd3ebd800,
    n_uncollectable=n_uncollectable@entry=0xffffd3ebd808, nofail=nofail@entry=0) at ../Python-3.10.4/Modules/gcmodule.c:1281
#11 0x0000ffff9c13abfc in gc_collect_with_callback (tstate=tstate@entry=0xaaab09ab9ba0, generation=generation@entry=2)
    at ../Python-3.10.4/Modules/gcmodule.c:1413
#12 0x0000ffff9c13b1e4 in PyGC_Collect () at ../Python-3.10.4/Modules/gcmodule.c:2099
#13 0x0000ffff9c117538 in Py_FinalizeEx () at ../Python-3.10.4/Python/pylifecycle.c:1781
#14 Py_FinalizeEx () at ../Python-3.10.4/Python/pylifecycle.c:1703
#15 0x0000ffff9c1181e8 in Py_Exit (sts=0) at ../Python-3.10.4/Python/pylifecycle.c:2858
#16 0x0000ffff9c11cbc0 in handle_system_exit () at ../Python-3.10.4/Python/pythonrun.c:775
#17 _PyErr_PrintEx (tstate=0xaaab09ab9ba0, set_sys_last_vars=set_sys_last_vars@entry=1) at ../Python-3.10.4/Python/pythonrun.c:785
#18 0x0000ffff9c11cbfc in PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at ../Python-3.10.4/Python/pythonrun.c:880
#19 0x0000ffff9c11cc0c in PyErr_Print () at ../Python-3.10.4/Python/pythonrun.c:886
#20 0x0000ffff9c11d280 in _PyRun_SimpleFileObject (fp=fp@entry=0xaaab09ab2540, filename=filename@entry=0xffff95cbe6b0, closeit=closeit@entry=1,
    flags=flags@entry=0xffffd3ebda48) at ../Python-3.10.4/Python/pythonrun.c:462
#21 0x0000ffff9c11d50c in _PyRun_AnyFileObject (fp=fp@entry=0xaaab09ab2540, filename=filename@entry=0xffff95cbe6b0, closeit=closeit@entry=1,
    flags=flags@entry=0xffffd3ebda48) at ../Python-3.10.4/Python/pythonrun.c:90
#22 0x0000ffff9c1389c0 in pymain_run_file_obj (skip_source_first_line=0, filename=0xffff95cbe6b0, program_name=0xffff95dce3d0)
    at ../Python-3.10.4/Modules/main.c:353
#23 pymain_run_file (config=0xaaab09af4290) at ../Python-3.10.4/Modules/main.c:372
#24 pymain_run_python (exitcode=0xffffd3ebda44) at ../Python-3.10.4/Modules/main.c:587
#25 Py_RunMain () at ../Python-3.10.4/Modules/main.c:666
#26 0x0000ffff9c138e6c in pymain_main (args=0xffffd3ebdb20) at ../Python-3.10.4/Modules/main.c:696
#27 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../Python-3.10.4/Modules/main.c:720
#28 0x0000ffff9bccb234 in ?? ()
   from /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6
#29 0x0000ffff9bccb30c in __libc_start_main ()
   from /home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/lib/libc.so.6
#30 0x0000aaaaca8cd9b0 in _start ()

I did try "py-bt" but that wasn't present. This can be fixed by
ensuring the source for python3-native is present ("bitbake python3-
native -c patch"), then

source build/tmp/work/aarch64-linux/python3-native/3.10.4-r0/Python-3.10.4/Tools/gdb/libpython.py

within gdb will add the python extensions. We can then get:

(gdb) py-bt
Traceback (most recent call first):
  File "/home/pokybuild/yocto-worker/qemuarm64-armhost/build/build/tmp/work/qemuarm64-poky-linux/core-image-minimal/1.0-r0/testsdkext/buildtools/sysroots/aarch64-pokysdk-linux/usr/lib/python3.10/_weakrefset.py", line 39, in _remove
    def _remove(item, selfref=ref(self)):
  Garbage-collecting

i.e. it is in python garbage collection. This probably means a
multiprocessing.Lock() or threading.Lock() was left in the locked state
by some other process/thread which exited and this has hung the build.

I still haven't worked out how to know *which* lock this is or how to
debug any further but I was quite pleased to even be able to debug this
to this point.

If anyone does have any further tips on this it is currently still
running in case we can figure out which lock it is/was.

Cheers,

Richard

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Join swat@lists.yoctoproject.org to automatically receive all group messages.