Help triaging git repo corruption caused by bb fetch


Sean McKay
 

Hi all,

 

Short version:

We have an intermittent issue where under certain circumstances, a build that fails on a fetch task may result in a corrupted parent repo. I’m wondering if anyone has heard of anything like this before or has suggestions on where to look through logs for potential culprits.

 

Long version (read the short version first):

Summary of what I know:

  • We first started running into this a few years ago on krogoth (2.1).
  • At the time, the team that (internally) owned poky did some investigation and put a patch into our copy of bitbake (see below) that seemed to address the issue. As far as I know, they did not attempt to upstream the patch at the time the issue was found.
  • As we’re currently in the middle of upgrading to Warrior, we found the patch and attempted to determine whether it was still relevant. After a week or so of attempts (which included talking to some of the engineers that did the original triage), we concluded that we couldn’t hit the issue and removed the patch.
  • A few days ago while working on our poky upgrade branch, I hit the issue again (once) while dealing with a proxy failure (the proxy was set incorrectly, so all outside fetch jobs would fail)
  • I have (as yet) been unable to reproduce the issue again. I cloned a new instance of our repo and wrote a quick script to rerun the commands that I attempted which originally resulted in the corruption. After a few days and ~10k iterations, I concluded there’s probably an external factor with the network connections and gave up attempting to reproduce it in that fashion.
  • The major symptoms are:
    • All branches in the repo are gone. They’ve been replaced by the refs for the branches that belong in the repo that bitbake was trying to fetch
    • Remotes from the repo being fetched have overwritten any remotes with matching names on the local repo (typically this is just origin)
    • All the objects from the repo being fetched have been merged with the objects from the existing repo (both sets of object files exist in the .git directory)
  • This is the patch that appears to have kept our users from encountering this problem over the last few years

diff --git a/yocto/poky/bitbake/lib/bb/fetch2/git.py b/yocto/poky/bitbake/lib/bb/fetch2/git.py

index 4de88f5ab91..457aae69420 100644

--- a/yocto/poky/bitbake/lib/bb/fetch2/git.py

+++ b/yocto/poky/bitbake/lib/bb/fetch2/git.py

@@ -186,9 +186,19 @@ class Git(FetchMethod):

 

         # If the checkout doesn't exist and the mirror tarball does, extract it

         if not os.path.exists(ud.clonedir) and os.path.exists(ud.fullmirror):

+            original_path = os.getcwd()

             bb.utils.mkdirhier(ud.clonedir)

             os.chdir(ud.clonedir)

-            runfetchcmd("tar -xzf %s" % (ud.fullmirror), d)

+            try:

+                runfetchcmd("tar -xzf %s" % (ud.fullmirror), d)

+            except bb.fetch2.FetchError as e:

+                logger.debug(1, "Error while extracting tarball: %s" % e.message)

+                # If tar fails, then remove the clonedir directory as the

+                # extracted repo is not complete.

+                os.chdir(original_path)

+                bb.utils.remove(ud.clonedir, True)

+                raise bb.fetch2.FetchError(

+                    "Unable to extract %s into %s" % (ud.fullmirror, ud.clonedir))

  • As best I can figure, the above probably fixes our issue as a side effect, since none of that code is going to be messing with the refs in the repo

 

Questions:

  • I still have the broken repo saved. Are there suggestions for where I might look in logs and ${TMPDIR} to figure out exactly what happened so I can better reproduce it?
  • Does anyone know if there are others who have experienced similar issues? Perhaps the circumstances surrounding their issues could point us in the right direction. Or mailing threads I should look at?

 

Thank you very much!

-Sean McKay

Join yocto@lists.yoctoproject.org to automatically receive all group messages.