How we do automated mobile device testing at Mozilla – Part 2

Video of this presentation from Release Engineering work week in Portland, 29 April 2014

Part 2: The foopy, Buildbot slaves, and SUT tools

So how does buildbot interact with a device, to perform testing?

By design, Buildbot masters require a Buildbot slave to perform any job. For example, if we have a Windows slave for creating Windows builds, we would expect to run a Buildbot slave on the Windows machine, and this would then be assigned tasks from the Buildbot master, which it would perform, and feed results back to the Buildbot master.

In the mobile device world, this is a problem:

  1. Running a slave process on the device would consume precious limited resources
  2. Buildbot does not run on phones, or mobile boards

Thus was born …. the foopy.

What the hell is a foopy?

A foopy is a machine, running Centos 6.2, that is devoted to the task of interfacing with pandas or tegras, and running buildbot slaves on their behalf.

My first mistake was thinking that a “foopy” is special piece of hardware. This is not the case. It is nothing more than a regular Centos 6.2 machine – just a regular server, that does not have any special physical connection to the mobile device boards – it is simply a machine that has been set aside for this purpose, that has network access to the devices, just like other machines in the same network.

For each device that a foopy is responsible for, it runs a dedicated buildbot slave. Typically each foopy serves between 10 and 15 devices. That means it will have around 10-15 buildbot slaves running on it, in parallel (assuming all devices are running ok).

When a Buildbot master assigns a job to a Buildbot slave running on the foopy, it will run the job inside its slave, but parts of the job will involve communicating with the device, pushing binaries onto it, running tests, and gathering results. As far as the Buildbot master is concerned, the slave is the foopy, and the foopy is doing all the work. It doesn’t need to know that the foopy is executing code on a tegra or panda. As far as the device is concerned, it is receiving tasks over the SUT Agent listener network interface, and performing those tasks.

So does the foopy always connect to the same devices?

Yes. Each foopy has a static list of devices for it to manage jobs for.

How do you see which devices a foopy manages?

If you ssh onto the foopy, you will see the devices it manages as subdirectories under /builds:

pmoore@fred:~/git/tools/sut_tools master $ ssh foopy106
Last login: Mon Apr 28 22:01:18 2014 from 10.22.248.82
Unauthorized access prohibited
[pmoore@foopy106.p10.releng.scl1.mozilla.com ~]$ find /builds -maxdepth 1 -type d -name 'tegra-*' -o -name 'panda-*'
/builds/panda-0078
/builds/panda-0066
/builds/panda-0064
/builds/panda-0071
/builds/panda-0072
/builds/panda-0080
/builds/panda-0070
/builds/panda-0074
/builds/panda-0062
/builds/panda-0063
/builds/panda-0067
/builds/panda-0073
/builds/panda-0076
/builds/panda-0075
/builds/panda-0079
/builds/panda-0077
/builds/panda-0068
/builds/panda-0061
/builds/panda-0065
[pmoore@foopy106.p10.releng.scl1.mozilla.com ~]$

How did those directories get created?

Manually. Each directory contains artefacts related to that panda or tegra, such as log files for verify checks, error flags if it is broken, disable flags if it has been disabled, etc. More about this later. Just know at this point that if you want that foopy to look after that device, you better create a directory for it.

So the directory existence on the foopy is useful to know which devices the foopy is responsible for, but how do you know which foopy manages an arbitrary device, without logging on to all foopies?

In the tools repository, the file buildfarm/mobile/devices.json also defines the mapping between foopy and device. Here is a sample:

{
 "tegra-010": {
 "foopy": "foopy109",
 "pdu": "pdu1.r602-11.tegra.releng.scl3.mozilla.com",
 "pduid": ".AA1"
 },
 "tegra-011": {
 "foopy": "foopy109",
 "pdu": "pdu2.r602-11.tegra.releng.scl3.mozilla.com",
 "pduid": ".AA1"
 },
 "tegra-012": {
 "foopy": "foopy109",
 "pdu": "pdu3.r602-11.tegra.releng.scl3.mozilla.com",
 "pduid": ".AA1"
 },
......
 "panda-0168": {
 "foopy": "foopy45",
 "relayhost": "panda-relay-014.p1.releng.scl1.mozilla.com",
 "relayid": "2:6"
 },
 "panda-0169": {
 "foopy": "foopy45",
 "relayhost": "panda-relay-014.p1.releng.scl1.mozilla.com",
 "relayid": "2:7"
 },
 "panda-0170": {
 "foopy": "foopy46",
 "relayhost": "panda-relay-015.p2.releng.scl1.mozilla.com",
 "relayid": "1:1"
 },
......
}

So what if the devices.json lists different foopy -> devices mappings than the foopy filesystems list? Isn’t there a danger this data gets out of sync?

Yes, there is nothing checking that these two data sources are equivalent. For example, if /builds/tegra-0123 was created on foopy39, but devices.json said tegra-0123 was assigned to foopy65, nothing would report this difference, and we would have non-deterministic behaviour.

Why is the foopy data not in slavealloc?

Currently the fields for the slaves are static across different slave types – so if we added a field for “foopy” for the foopies, it would also appear for all other slave types, which don’t have a foopy association.

What is that funny other data in the devices.json file?

The “pdu” and “pduid” are the coordinates required to determine the physical power supply of the tegra. These are the values that you call the PDU API with to enable/disable power for that particular tegra.

The “relayhost” and “relayid” are the equivalent values for the panda power supplies.

Where does this data come from?

This data is maintained in IT’s inventory database. We duplicate this information in this file.

Example: https://inventory.mozilla.org/en-US/systems/show/2706/

So is a PDU and a relay board essentially the same thing, just one is for pandas, and the other for tegras?

Yes.

What about if we want to write comments in this file? json doesn’t support comments, right?

For example, you want to put a comment to explain why a tegra is not assigned to a PDU. For this, since json currently does not support comments, we add a _comment field, e.g.:

 "tegra-024": {
 "_comment": "Bug 727345: Assigned to WebQA",
 "foopy": "None"
 },

Is there any sync process between inventory and devices.json to guarantee integrity of the relayboard and PDU data?

No. We do not sync the data, so there is a risk our data can get out-of-sync. This could be solved by having an auto-sync to the devices.json file, or using inventory as the data source, rather than the devices.json file.

So how do we interface with the PDUs / relay boards to hard reboot devices?

This is done using sut_tools reboot.py script.

Is there anything else useful in this “sut tools” folder?

Yes, lots. This provides scripts for doing all sorts, like deploying artefacts on tegras and pandas, rebooting, running smoke tests and verifying the devices, cleaning up devices, accessing device logs, etc.

Summary part 2

So we’ve learned:

  • Tegras and Pandas do not run buildbot slaves, we have dedicated machines to run buildbot slaves on their behalf, called foopies
  • Foopies are regular Centos 6.2 machines, with one buildbot slave running per device that they manage
  • Foopies manage typically 10-15 devices
  • The mappings of foopy -> devices is stored in the devices.json file in the tools project
  • This file is maintained by hand, but contains data that came from IT inventory database for PDU / relay boards
  • PDUs and relay boards are the devices that control the power supply to the tegras / pandas respectively
  • We can power cycle devices by using the reboot.py script in the sut_tools directory of the tools repository
  • There are other useful tools in “sut tools” folder for device tasks
  • Foopies are not in slavealloc

< Part 1    Part 3 >

2 thoughts on “How we do automated mobile device testing at Mozilla – Part 2

  1. Pingback: How we do automated mobile device testing at Mozilla – Part 1 | The Open Web

  2. Pingback: How we do automated mobile device testing at Mozilla – Part 3 | The Open Web

Leave a comment