Re: Yocto SWAT team kickoff


Flanagan, Elizabeth <elizabeth.flanagan@...>
 

I've done some initial wikification of this:

https://wiki.yoctoproject.org/wiki/Yocto_Build_Failure_Swat_Team#Live_Debugging_Process

Michael Halstead, as the SA, should probably be included on this as
access rights to the infrastructure should in most cases, go through
him.

-b

On Tue, Jan 10, 2012 at 7:24 AM, Liu, Song <song.liu@...> wrote:
Hi all,

We would like to kick off the Yocto SWAT team this week. Please see the following for the purpose of the SWAT team and let me know if you have any questions or concerns. We welcome any community participation on the SWAT team. At the same time, I will work with the team to make sure thing get started.

Thanks,
Song

YOCTO SWAT TEAM

GOAL

The assembly of the Yocto Project SWAT team is mainly to tackle urgent technical problems that break build on the master branch or major release branches in a timely manner, thus to maintain the stability of the master and release branch. The SWAT team includes volunteers or appointed members of the Yocto Project team. Community members can also volunteer to be part of the SWAT team.

SCOPE OF RESPONSIBILITY

Whenever a build (nightly build, weekly build, release build) fails, the SWAT team is responsible for ensuring the necessary debugging occurs and organizing resources to solve the issue and ensure successful builds. If resolving the issues requires schedule or resource adjustment, the SWAT team should work with program and development management to accommodate the change in the overall planning.

MEMBERS:

* Darren Hart (US)
* Elizabeth Flanagan (US)
* Paul Eggleton (UK)
* Jessica Zhang (US)
* Dexuan Cui (CN)
* Saul Wold (US)
* Richard Purdie (UK)

ROTATING CHAIR:

A chairperson role will be rotated among team members each week. The Chairperson should monitor the build status for the entire week. Whenever a build is broken, the Chairperson should do necessary debugging and organize resources to solve the problems in a timely manner to meet the overall project and release schedule. The Chairperson serves as the focal point of the SWAT team to external people such as program managers or development managers.

ROTATING PROCESS

Each week on a specific day (propose Monday), a SWAT team meeting could be called at the chairperson's discretion to discuss current issues and status. Either during the meeting or offline, the Chairperson of last week will identify and pass the role to another person in the team. The program manager should be notified at the same time. Usually, this will take a simple round robin order. In case the next person cannot take the role due to tight schedule, vacation or some other reasons, the role will be passed to the next person.

The current Chairperson's full name and email address will be published on the project status wiki page: https://wiki.yoctoproject.org/wiki/Yocto_Project_v1.2_Status under "Current SWAT team Chairperson" section.

BKM (RICHARD PURDIE)

When looking at a failure, the first question is what the baseline was and what changed. If there were recent known good builds it helps to narrow down the number of changes that were likely responsible for the failure. It's also useful to note if the build was from scratch or from existing sstate files. You can tell by seeing what "setscene" tasks run in the log.

The primary responsibility is to ensure that any failures are categorized correctly and that the right people get to know about them.

It's important *someone* is then tasked with fixing it. Image failures are particular tricky since its likely some component of the image that failed and the question is then whether that component changed recently, whether it was some kind of core functionality at fault and so on.

Ideally we want to get the failure reported to the person who knows something about the area and can come up with a fix without it distracting them too much.
As a secondary responsibility, its often helpful for to triage the failure. This might mean documenting a way to reproduce the failure outside a full build and/or documenting how the failure is happening and maybe even propose a fix.

Sometimes failures are difficult to understand and can require direct ssh access to the autobuilder so the issue can be debugged passively on the system to examine contents of files and so forth. If doing this ensure you don't change any of the file system for example adding files that couldn't then be deleted by the autobuilder when it rebuilds.

Rarely, "live" debugging might be needed where you'd su to the pokybuild user and run a build manually to see the failure in real time. If doing this, ensure you only create files as the pokybuild user and you are careful not to generate sstate packages which shouldn't be present or any other bad state that might get reused. In general its recommended not to do "live" debugging. This can be escalated to RP/Saul/Beth if needed.

To fulfill the primary responsibility, it's suggested that bugs are opened on the bugzilla for each type of failure. This way, appropriate people can be brought into the discussion and a specific owner of the failure can be assigned. Replying to the build failure with the bug ID and also bringing the bug to the attention of anyone you suspect was responsible for the problem are also good practices.

_______________________________________________
yocto mailing list
yocto@...
https://lists.yoctoproject.org/listinfo/yocto
--
Elizabeth Flanagan
Yocto Project
Build and Release

Join yocto@lists.yoctoproject.org to automatically receive all group messages.