Video of this presentation from Release Engineering work week in Portland, 29 April 2014
Part 3: Keeping the devices running
So in Part 1 and 2, we saw how Buildbot tegra and panda masters can assign jobs to Buildbot slaves, and that these slaves run on foopies, and that these foopies then connect to the SUT Agent on the device, to deploy and perform the tests, and pull back results.
However, over time, since these devices can fail, how do we make sure they are running ok, and handle the case that they go awol?
The answer has two parts:
What is watch_devices.sh?
You remember that in Part 2, we said you need to create a directory under /builds on the foopy for any device that foopy should be taking care of.
This script will look for device directories under /tools to see which devices are associated to this foopy. For each of these, it will check there is a buildbot slave running for that device. It handles the case of automatically starting buildbot slaves as necessary, if they are not running, but also checks the health of the device, by using the verification tools of SUT tools (discussed in Part 2). If it finds a problem with a device, it will also shutdown the buildbot slave, so that it does not get new jobs. In short, it keeps the state of the buildbot slave consistent with what it believes the availability of the device to be. If the device is faulty, it brings down the buildbot slave for that device. If it is a healthy device, passing the verification tests, it will start up the buildbot slave if it is not running.
It also checks the “disabled” state of the device from slavealloc, and makes sure if it is “disabled” in slavealloc, that the buildbot slave will be shutdown.
Therefore if you need to disable a device, by marking it as disabled in slavealloc, watch_devices.sh running from a cron tab on the foopy, will bring down the buildbot slave of the device.
Where are the log files of watch_devices.sh?
They are on the foopy:
- /builds/watcher.log (global)
- /builds/<device>/watcher.log (per device)
If during a buildbot test we determine that a device is not behaving properly, how do we pull it out of use?
If a serious problem is found with a device during a buildbot job, the buildbot job will create an error.flg file under the device directory on the foopy. This signals to watch_devices.sh that when that job has completed, it should kill the buildbot slave, since the device is faulty. It should not respawn a buildbot slave while that error.flg file remains. Once per hour, it will delete the error.flg file, to force another verification test of the device.
But wait, I heard that mozpool verifies devices and keeps them alive?
Yes and no. Mozpool is a tool (written by Dustin) to take care of the life-cycle management of panda boards. It does not manage tegras. Remember: tegras cannot be automatically reimaged – you need fingers to press buttons on the devices, and physically connect a laptop to them. Pandas can. This is why mozpool only takes care of pandas.
Mozpool is made up of three layered components. From the mozpool overview (http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/):
- Mozpool is the highest-level interface, where users request a device in a certain condition, and Mozpool finds a suitable device.
- Lifeguard is the middle level. It manages the state of devices, and knows how to cajole and coddle them to achieve reliable behavior.
- Black Mobile Magic is the lowest level. It deals with devices directly, including controlling their power and PXE booting them. Be careful using this level!
So the principles behind mozpool, is that all the logic you have around getting a panda board, making sure it is clean and ready to use, contains the right OS image you want to run it with, etc – can be handled outside of the buildbot jobs. You would just query mozpool, tell it you’d like a device, specify the operating system image you want, and it will get you one.
In the background it is monitoring the devices and checking they are ok, only handing you a “good” device, and cleaning up when you finish with it.
So watch_devices and mozpool are both routinely running verification tests against the pandas?
No. This used to be the case, but now the verification test of watch_devices.sh for pandas simply queries mozpool to get the status of the device. It no longer directly runs verification tests against the panda, to avoid that we have two systems doing the same. It trusts mozpool to tell it the correct state.
So if I dynamically get a device from mozpool when I ask for one, does that mean my buildbot slave might get different devices at different times, depending on which devices are currently available and working at the time of the request?
No. Since the name of the buildbot slave is the same as the name of the device, the buildbot slave is bound to the one device only. This means it cannot take advantage of the “give me a panda with this image, i don’t care which one” model.
Summary part 3
So we’ve learned:
- there is a cron job running on the foopies, that looks for the device directories under /builds, and spawns/kills buildbot slaves as appropriate, so that the state of the buildbot slave matches the availability of the device
- mozpool is a tool for automatically reimaging pandas
- not all features of mozpool are available due to our buildbot setup (such as being able to get an arbitrary panda dynamically at runtime for a given buildbot slave)