How we do automated mobile device testing at Mozilla – Part 3

Video of this presentation from Release Engineering work week in Portland, 29 April 2014

Part 3: Keeping the devices running

So in Part 1 and 2, we saw how Buildbot tegra and panda masters can assign jobs to Buildbot slaves, and that these slaves run on foopies, and that these foopies then connect to the SUT Agent on the device, to deploy and perform the tests, and pull back results.

However, over time, since these devices can fail, how do we make sure they are running ok, and handle the case that they go awol?

The answer has two parts:

  1. watch_devices.sh
  2. mozpool

What is watch_devices.sh?

You remember that in Part 2, we said you need to create a directory under /builds on the foopy for any device that foopy should be taking care of.

Well there is a cron job installed under /etc/cron.d/foopy that takes care of running watch_devices.sh every 5 mins.

This script will look for device directories under /tools to see which devices are associated to this foopy. For each of these, it will check there is a buildbot slave running for that device. It handles the case of automatically starting buildbot slaves as necessary, if they are not running, but also checks the health of the device, by using the verification tools of SUT tools (discussed in Part 2). If it finds a problem with a device, it will also shutdown the buildbot slave, so that it does not get new jobs. In short, it keeps the state of the buildbot slave consistent with what it believes the availability of the device to be. If the device is faulty, it brings down the buildbot slave for that device. If it is a healthy device, passing the verification tests, it will start up the buildbot slave if it is not running.

It also checks the “disabled” state of the device from slavealloc, and makes sure if it is “disabled” in slavealloc, that the buildbot slave will be shutdown.

Therefore if you need to disable a device, by marking it as disabled in slavealloc, watch_devices.sh running from a cron tab on the foopy, will bring down the buildbot slave of the device.

Where are the log files of watch_devices.sh?

They are on the foopy:

  • /builds/watcher.log (global)
  • /builds/<device>/watcher.log (per device)

If during a buildbot test we determine that a device is not behaving properly, how do we pull it out of use?

If a serious problem is found with a device during a buildbot job, the buildbot job will create an error.flg file under the device directory on the foopy. This signals to watch_devices.sh that when that job has completed, it should kill the buildbot slave, since the device is faulty. It should not respawn a buildbot slave while that error.flg file remains. Once per hour, it will delete the error.flg file, to force another verification test of the device.

But wait, I heard that mozpool verifies devices and keeps them alive?

Yes and no. Mozpool is a tool (written by Dustin) to take care of the life-cycle management of panda boards. It does not manage tegras. Remember: tegras cannot be automatically reimaged – you need fingers to press buttons on the devices, and physically connect a laptop to them. Pandas can. This is why mozpool only takes care of pandas.

Mozpool is made up of three layered components. From the mozpool overview (http://mobile-imaging-001.p1.releng.scl1.mozilla.com/ui/):

  1. Mozpool is the highest-level interface, where users request a device in a certain condition, and Mozpool finds a suitable device.
  2. Lifeguard is the middle level. It manages the state of devices, and knows how to cajole and coddle them to achieve reliable behavior.
  3. Black Mobile Magic is the lowest level. It deals with devices directly, including controlling their power and PXE booting them. Be careful using this level!

So the principles behind mozpool, is that all the logic you have around getting a panda board, making sure it is clean and ready to use, contains the right OS image you want to run it with, etc – can be handled outside of the buildbot jobs. You would just query mozpool, tell it you’d like a device, specify the operating system image you want, and it will get you one.

In the background it is monitoring the devices and checking they are ok, only handing you a “good” device, and cleaning up when you finish with it.

So watch_devices and mozpool are both routinely running verification tests against the pandas?

No. This used to be the case, but now the verification test of watch_devices.sh for pandas simply queries mozpool to get the status of the device. It no longer directly runs verification tests against the panda, to avoid that we have two systems doing the same. It trusts mozpool to tell it the correct state.

So if I dynamically get a device from mozpool when I ask for one, does that mean my buildbot slave might get different devices at different times, depending on which devices are currently available and working at the time of the request?

No. Since the name of the buildbot slave is the same as the name of the device, the buildbot slave is bound to the one device only. This means it cannot take advantage of the “give me a panda with this image, i don’t care which one” model.

Summary part 3

So we’ve learned:

  • there is a cron job running on the foopies, that looks for the device directories under /builds, and spawns/kills buildbot slaves as appropriate, so that the state of the buildbot slave matches the availability of the device
  • mozpool is a tool for automatically reimaging pandas
  • not all features of mozpool are available due to our buildbot setup (such as being able to get an arbitrary panda dynamically at runtime for a given buildbot slave)

< Part 2

How we do automated mobile device testing at Mozilla – Part 2

Video of this presentation from Release Engineering work week in Portland, 29 April 2014

Part 2: The foopy, Buildbot slaves, and SUT tools

So how does buildbot interact with a device, to perform testing?

By design, Buildbot masters require a Buildbot slave to perform any job. For example, if we have a Windows slave for creating Windows builds, we would expect to run a Buildbot slave on the Windows machine, and this would then be assigned tasks from the Buildbot master, which it would perform, and feed results back to the Buildbot master.

In the mobile device world, this is a problem:

  1. Running a slave process on the device would consume precious limited resources
  2. Buildbot does not run on phones, or mobile boards

Thus was born …. the foopy.

What the hell is a foopy?

A foopy is a machine, running Centos 6.2, that is devoted to the task of interfacing with pandas or tegras, and running buildbot slaves on their behalf.

My first mistake was thinking that a “foopy” is special piece of hardware. This is not the case. It is nothing more than a regular Centos 6.2 machine – just a regular server, that does not have any special physical connection to the mobile device boards – it is simply a machine that has been set aside for this purpose, that has network access to the devices, just like other machines in the same network.

For each device that a foopy is responsible for, it runs a dedicated buildbot slave. Typically each foopy serves between 10 and 15 devices. That means it will have around 10-15 buildbot slaves running on it, in parallel (assuming all devices are running ok).

When a Buildbot master assigns a job to a Buildbot slave running on the foopy, it will run the job inside its slave, but parts of the job will involve communicating with the device, pushing binaries onto it, running tests, and gathering results. As far as the Buildbot master is concerned, the slave is the foopy, and the foopy is doing all the work. It doesn’t need to know that the foopy is executing code on a tegra or panda. As far as the device is concerned, it is receiving tasks over the SUT Agent listener network interface, and performing those tasks.

So does the foopy always connect to the same devices?

Yes. Each foopy has a static list of devices for it to manage jobs for.

How do you see which devices a foopy manages?

If you ssh onto the foopy, you will see the devices it manages as subdirectories under /builds:

pmoore@fred:~/git/tools/sut_tools master $ ssh foopy106
Last login: Mon Apr 28 22:01:18 2014 from 10.22.248.82
Unauthorized access prohibited
[pmoore@foopy106.p10.releng.scl1.mozilla.com ~]$ find /builds -maxdepth 1 -type d -name 'tegra-*' -o -name 'panda-*'
/builds/panda-0078
/builds/panda-0066
/builds/panda-0064
/builds/panda-0071
/builds/panda-0072
/builds/panda-0080
/builds/panda-0070
/builds/panda-0074
/builds/panda-0062
/builds/panda-0063
/builds/panda-0067
/builds/panda-0073
/builds/panda-0076
/builds/panda-0075
/builds/panda-0079
/builds/panda-0077
/builds/panda-0068
/builds/panda-0061
/builds/panda-0065
[pmoore@foopy106.p10.releng.scl1.mozilla.com ~]$

How did those directories get created?

Manually. Each directory contains artefacts related to that panda or tegra, such as log files for verify checks, error flags if it is broken, disable flags if it has been disabled, etc. More about this later. Just know at this point that if you want that foopy to look after that device, you better create a directory for it.

So the directory existence on the foopy is useful to know which devices the foopy is responsible for, but how do you know which foopy manages an arbitrary device, without logging on to all foopies?

In the tools repository, the file buildfarm/mobile/devices.json also defines the mapping between foopy and device. Here is a sample:

{
 "tegra-010": {
 "foopy": "foopy109",
 "pdu": "pdu1.r602-11.tegra.releng.scl3.mozilla.com",
 "pduid": ".AA1"
 },
 "tegra-011": {
 "foopy": "foopy109",
 "pdu": "pdu2.r602-11.tegra.releng.scl3.mozilla.com",
 "pduid": ".AA1"
 },
 "tegra-012": {
 "foopy": "foopy109",
 "pdu": "pdu3.r602-11.tegra.releng.scl3.mozilla.com",
 "pduid": ".AA1"
 },
......
 "panda-0168": {
 "foopy": "foopy45",
 "relayhost": "panda-relay-014.p1.releng.scl1.mozilla.com",
 "relayid": "2:6"
 },
 "panda-0169": {
 "foopy": "foopy45",
 "relayhost": "panda-relay-014.p1.releng.scl1.mozilla.com",
 "relayid": "2:7"
 },
 "panda-0170": {
 "foopy": "foopy46",
 "relayhost": "panda-relay-015.p2.releng.scl1.mozilla.com",
 "relayid": "1:1"
 },
......
}

So what if the devices.json lists different foopy -> devices mappings than the foopy filesystems list? Isn’t there a danger this data gets out of sync?

Yes, there is nothing checking that these two data sources are equivalent. For example, if /builds/tegra-0123 was created on foopy39, but devices.json said tegra-0123 was assigned to foopy65, nothing would report this difference, and we would have non-deterministic behaviour.

Why is the foopy data not in slavealloc?

Currently the fields for the slaves are static across different slave types – so if we added a field for “foopy” for the foopies, it would also appear for all other slave types, which don’t have a foopy association.

What is that funny other data in the devices.json file?

The “pdu” and “pduid” are the coordinates required to determine the physical power supply of the tegra. These are the values that you call the PDU API with to enable/disable power for that particular tegra.

The “relayhost” and “relayid” are the equivalent values for the panda power supplies.

Where does this data come from?

This data is maintained in IT’s inventory database. We duplicate this information in this file.

Example: https://inventory.mozilla.org/en-US/systems/show/2706/

So is a PDU and a relay board essentially the same thing, just one is for pandas, and the other for tegras?

Yes.

What about if we want to write comments in this file? json doesn’t support comments, right?

For example, you want to put a comment to explain why a tegra is not assigned to a PDU. For this, since json currently does not support comments, we add a _comment field, e.g.:

 "tegra-024": {
 "_comment": "Bug 727345: Assigned to WebQA",
 "foopy": "None"
 },

Is there any sync process between inventory and devices.json to guarantee integrity of the relayboard and PDU data?

No. We do not sync the data, so there is a risk our data can get out-of-sync. This could be solved by having an auto-sync to the devices.json file, or using inventory as the data source, rather than the devices.json file.

So how do we interface with the PDUs / relay boards to hard reboot devices?

This is done using sut_tools reboot.py script.

Is there anything else useful in this “sut tools” folder?

Yes, lots. This provides scripts for doing all sorts, like deploying artefacts on tegras and pandas, rebooting, running smoke tests and verifying the devices, cleaning up devices, accessing device logs, etc.

Summary part 2

So we’ve learned:

  • Tegras and Pandas do not run buildbot slaves, we have dedicated machines to run buildbot slaves on their behalf, called foopies
  • Foopies are regular Centos 6.2 machines, with one buildbot slave running per device that they manage
  • Foopies manage typically 10-15 devices
  • The mappings of foopy -> devices is stored in the devices.json file in the tools project
  • This file is maintained by hand, but contains data that came from IT inventory database for PDU / relay boards
  • PDUs and relay boards are the devices that control the power supply to the tegras / pandas respectively
  • We can power cycle devices by using the reboot.py script in the sut_tools directory of the tools repository
  • There are other useful tools in “sut tools” folder for device tasks
  • Foopies are not in slavealloc

< Part 1    Part 3 >

How we do automated mobile device testing at Mozilla – Part 1

Video of this presentation from Release Engineering work week in Portland, 29 April 2014

Part 1: Back to basics

What software do we produce for mobile phones?

  • Firefox for Android (Fennec)
  • Firefox OS (B2G)

What environments do we use for building and testing this software?

Building Testing
Fennec CentOS 6.2
(bld-linux64-ix-*) in-house
(bld-linux64-ec2-*) AWS
Tegra / Panda / Emulator
B2G CentOS 6.2 Emulator

So first key point unveiled:

  • We don’t build on tegras and pandas (we only test!)

Second key point:

  • Fennec is the only product we test on tegras and pandas (we don’t test B2G on real devices)

So why do we test Fennec on tegras, pandas and emulators?

To answer this, first remember the wide variety of builds and tests we perform:

Screenshot from tbpl

Screenshot from tbpl

The answer is:

  • We use tegras to test: Android 2.2 (Froyo)
  • We use pandas to test: Android 4.0 (Ice Cream Sandwich)
  • We use emulators to test: Android 2.3 (Gingerbread) and Android 4.2 (Jelly Bean)

Notice:

  • We don’t test on 3.x (Honeycomb)
  • We don’t test on 4.4 (KitKat)
  • The versions we test on emulators are not sequencial (i.e. we test 2.3 and 4.2 on emulators – with 4.0 tested on pandas – in the middle of these two versions)

What are the main differences between our tegras and pandas?

Tegras Pandas
Look like this: Look like this:
Tegra250_plugged_in 2012-08-06-10.23.28-768x1024
Racked up like this: Racked up like this:
blog_racks_in_faraday_cage 2012-11-09-08.30.50
Older Newer
Running Android 2.2 Running Android 4.0
Hanging in shoe racks Racked professionally in Faraday cages
Can only be reimaged by physically connecting them to a laptop, and pressing buttons in a magical sequence can be remotely reimaged by mozpool (moar to come later)
Not very reliable Quite reliable
Is connected to a “PDU” which allows us to programatically call an API to “pull the power” Is connected to a “relay host” which allows us to programatically call an API to “pull the power”

So as you see, a panda is a more serious piece of kit than a tegra. Think of a tegras as a toy.

So what are tegras and a pandas, actually?

Both are mobile device boards, as you see above, like you would get in a phone, but not actually in a phone.

So why don’t we just use real phones?

  1. Real phones use batteries
  2. Real phones have wireless network

Basically, by using the boards directly, we can:

  1. control the power supply (by connecting them to power units – PDUs) which we have API access to (i.e. we have an API to pull the power to a device)
  2. use ethernet, rather than wireless (which is more reliable, wireless signals don’t interfere with each other, less radiation, …)

OK, so we have phones (or “phone circuit boards”) wired up to our network – but how do we communicate with them?

Fennec historically ran on more platforms than just Android. It also ran on:

  • Windows Mobile
  • the Nokia N900 Maemo device

For this reason, it was decided to create a generic interface, which would be implemented on all supported platforms. The SUT Agent was born.

Please note: nowadays, Fennec it only available for Android 2.2+. It is not available for iOS (iPhone, iPad, iPod Touch), Windows Phone, Windows RT, Bada, Symbian, Blackberry OS, webOS or other operating systems for mobile.

Therefore, the original reason for creating a standard interface to all devices (the SUT Agent) no longer exists. It would also be possible to use a different mechanism (telnet, ssh, adb, …) to communicate with the device. However, this is not what we do.

So what is the SUT Agent, and what can it do?

The SUT Agent is a listener running on the tegra or panda, that can receive calls over its network interface, to tell it to perform tasks. You can think of it as something like an ssh daemon, in the sense that you can connect to it from a different machine, and issue commands.

How do you connect to it?

You simply telnet to the tegra or foopy, on port 20700 or 20701.

Why two ports? Are the different?

Only marginally. The original idea was that users would connect on port 20701, and that automated systems would connect on port 20700. For this reason, if you connect on port 20700, you don’t get a prompt. If you connect on port 20701, you do. However, everything else is the same. You can issue commands to both listeners.

What commands does it support?

The most important command is “help”. It displays this output, showing all available commands:

pmoore@fred:~/git/tools/sut_tools master $ telnet panda-0149 20701
Trying 10.12.128.132...
Connected to panda-0149.p1.releng.scl1.mozilla.com.
Escape character is '^]'.
$>help
run [cmdline] - start program no wait
exec [env pairs] [cmdline] - start program no wait optionally pass env
 key=value pairs (comma separated)
execcwd <dir> [env pairs] [cmdline] - start program from specified directory
execsu [env pairs] [cmdline] - start program as privileged user
execcwdsu <dir> [env pairs] [cmdline] - start program from specified directory as privileged user
execext [su] [cwd=<dir>] [t=<timeout>] [env pairs] [cmdline] - start program with extended options
kill [program name] - kill program no path
killall - kill all processes started
ps - list of running processes
info - list of device info
 [os] - os version for device
 [id] - unique identifier for device
 [uptime] - uptime for device
 [uptimemillis] - uptime for device in milliseconds
 [sutuptimemillis] - uptime for SUT in milliseconds
 [systime] - current system time
 [screen] - width, height and bits per pixel for device
 [memory] - physical, free, available, storage memory
 for device
 [processes] - list of running processes see 'ps'
alrt [on/off] - start or stop sysalert behavior
disk [arg] - prints disk space info
cp file1 file2 - copy file1 to file2
time file - timestamp for file
hash file - generate hash for file
cd directory - change cwd
cat file - cat file
cwd - display cwd
mv file1 file2 - move file1 to file2
push filename - push file to device
rm file - delete file
rmdr directory - delete directory even if not empty
mkdr directory - create directory
dirw directory - tests whether the directory is writable
isdir directory - test whether the directory exists
chmod directory|file - change permissions of directory and contents (or file) to 777
stat processid - stat process
dead processid - print whether the process is alive or hung
mems - dump memory stats
ls - print directory
tmpd - print temp directory
ping [hostname/ipaddr] - ping a network device
unzp zipfile destdir - unzip the zipfile into the destination dir
zip zipfile src - zip the source file/dir into zipfile
rebt - reboot device
inst /path/filename.apk - install the referenced apk file
uninst packagename - uninstall the referenced package and reboot
uninstall packagename - uninstall the referenced package without a reboot
updt pkgname pkgfile - unpdate the referenced package
clok - the current device time expressed as the number of millisecs since epoch
settime date time - sets the device date and time
 (YYYY/MM/DD HH:MM:SS)
tzset timezone - sets the device timezone format is
 GMTxhh:mm x = +/- or a recognized Olsen string
tzget - returns the current timezone set on the device
rebt - reboot device
adb ip|usb - set adb to use tcp/ip on port 5555 or usb
activity - print package name of top (foreground) activity
quit - disconnect SUTAgent
exit - close SUTAgent
ver - SUTAgent version
help - you're reading it
$>quit
quit
$>Connection closed by foreign host.

Typically we use the SUT Agent to query the device, push Fennec and tests onto it, run tests, perform file system commands, execute system calls, and retrieve results and data from the device.

What is the difference between quit and exit commands?

I’m glad you asked. “quit” will terminate the session. “exit” will shut down the sut agent. You really don’t want to do this. Be very careful.

Is the SUT Agent a daemon? If it dies, will it respawn?

No, it isn’t, but yes, it will!

The SUT Agent can die, and sometimes does. However, it has a daddy, who watches over it. The Watcher is a daemon, also running on the pandas and tegras, that monitors the SUT Agent. If the SUT Agent dies, the Watcher will spawn a new SUT Agent.

Probably it would be possible to have the SUT Agent as an auto-respawning daemon – I’m not sure why it isn’t this way.

Who created the Watcher?

Legend has it, that the Watcher was created by Bob Moss.

Where is the source code for the SUT Agent and the Watcher?

The SUT Agent codebase lives in the firefox desktop source tree: http://hg.mozilla.org/mozilla-central/file/tip/build/mobile/sutagent

The Watcher code lives there too: http://hg.mozilla.org/mozilla-central/file/tip/build/mobile/sutagent/android/watcher

Does the Watcher and SUT Agent get automatically deployed when there are new changes?

No. If there are changes, they need to be manually built (no continuous integration) and manually deployed to all tegras, and a new image needs to be created for pandas in mozpool (will be explained later).

Fortunately, there are very rarely changes to either component.

Summary part 1

So we’ve learned:

  • Tegras and Pandas are used for testing Fennec for Android
  • They run different versions of the Android OS (2.2 vs 4.0)
  • We don’t build anything on them
  • Tegras are older/inferior/less reliable than pandas
  • We can’t reimage tegras programmatically, but pandas we can
  • There is a SUT Agent that runs on both the tegras and the pandas, and provides a mechanism to interact with it
  • There is a Watcher that keeps the SUT Agent alive
  • Whenever a new version of SUT Agent or Watcher is required, this needs to be manually built and rolled out to devices

> Part 2