IoT Course Week 11: Load Testing

Screen Shot 2015-10-15 at 3.07.21 PM

Last week, we dove into into the importance of incorporating and collecting analytics through your connected device, how that information helps provide business value, and played with some of the ways that information can be displayed using some pretty graphs.

This Week

This week, we’ll continue our focus on non-functional requirements and start load testing. With connected devices, if the device can’t call home to its shared services, it loses a lot of its value as a smart device. These services need to be highly reliable, but things get interesting when thousands or millions of devices decide to call home at the same time.

To load test, we’ll generate concurrent usage on system until a limit, bottleneck, unexpected behavior, or issue is discovered. This usage should model real-life usage as close as possible, so the analytics we put in place last week will be a valuable resource. In instances where we don’t have data to work with, we can build out user funnels and extrapolate based on anticipated usage. Bad things will happen if we ship thousands of products without any idea how our system will react under the load. This data will also be a useful baseline for capacity planning and system optimization experiments.

The Lampi system has two shared services that we need to put under load. One is the Django web server that handles login, and the other is the MQTT broker that handles sending messages to the lamp.

Load Testing with Locust

To test the web server we use Locust. Locust has become a LeanDog favorite due to its simple design, scalability, extensibility, and scriptability. We’ve used it to generate loads of 200,000 simultaneous users distributed across the US, Singapore, Ireland, and Brazil. These simulated users (locusts) walked through multi-page workflows at varying probabilities, modeling the end-to-end user interaction, complete with locusts dropping out of the user funnel at known decision points.

Locusts are controlled via a locustfile.py. The one below shows a user logging in and going to the home page:

from locust import HttpLocust, TaskSet, task

class UserBehavior(TaskSet):

def on_start(self):
self.login()

def login(self):
response = self.client.get("/accounts/login/?next=/")
csrftoken = response.cookies.get('csrftoken', '')
self.client.post("/accounts/login/?next=/", {"csrfmiddlewaretoken": csrftoken, "username": {{USERNAME}}, "password": {{PASSWORD}} })

@task(1)
def load_page(self):
self.client.get("/")

class WebsiteUser(HttpLocust):
task_set = UserBehavior
min_wait = 5000
max_wait = 9000

In order to run locust, we’ll need a machine outside of the system to simulate a number of devices. Locust is a python package, so it can run on most OSes. It uses a master/slave architecture so you can distribute the simulated users and allow for more and more load.

Once you install locust and start the process, you control the test via a web interface.

Screen Shot 2016-05-04 at 12.09.46 PM

Locust will aggregate the requests to a particular endpoint and provide statistics and errors for those requests.  

Screen Shot 2016-05-04 at 12.09.55 PM

Screen Shot 2016-05-04 at 12.10.03 PM

Load Testing with Malaria

To test MQTT we used a fork of Malaria. Malaria was designed to exercise MQTT brokers. Like locust, Malaria spawns a number of processes to publish MQTT messages. Unlike locust, it’s not easy to script; you have to fork it to do parametric testing or randomize data.

usage: malaria publish [-D DEVICE_ID] [-H HOST] [-p PORT] [-n MSG_COUNT] [-P PROCESSES]

Publish a stream of messages and capture statistics on their timing

optional arguments:
-D DEVICE_ID (Set the device id of the publisher)
-H HOST (MQTT host to connect to (default: localhost))
-p PORT, (Port for remote MQTT host (default: 1883))
-n MSG_COUNT (How many messages to send (default: 10))
-P PROCESSES (How many separate processes to spin up (default: 1))

By modulating MSG_COUNT and PROCESSES you can control the load being sent to the broker.

Running some Example loads
Small load: Using 1 process, send 10 messages, from device id [device_id]

loadtest$ ./malaria publish -H [broker_ip] -n 10 -P 1 -D [device_id]

Produces results similar to this:

Clientid: Aggregate stats (simple avg) for 1 processes
Message success rate: 100.00% (10/10 messages)
Message timing mean 344.51 ms
Message timing stddev 2.18 ms
Message timing min 340.89 ms
Message timing max 347.84 ms
Messages per second 4.99
Total time 14.04 secs

Large load: Using 8 processes, send 10,000 messages each, from device id [device_id]

loadtest$ ./malaria publish -H 192.168.0.42 -n 10000 -P 8 -D [device_id]

Monitoring The Broker

MQTT provides a set of topics that allow you to monitor the broker.

This command will show all the monitoring topics (note that the $ is escaped with a backslash):

cloud$ mosquitto_sub -v -t \$SYS/#

The sub topics ‘…\load...’ are of particular interest.

Gather data

Before we start testing, we should figure out what metrics we want to measure. Resources on the shared system (CPU, memory, bandwidth, file handles) are good candidates for detecting capacity issues. Focusing on the user experience (failure rate, response time, latency) will help you hone in on the issues that will incur support costs or retention problems. Building the infrastructure to gather, analyze and visualize those metrics can be a significant part of the load testing process – but those tools are also necessary to do useful operational support in production. For the class, students used sysstat, locust, mqtt and malaria to gather metrics. A production-like system might use AWS Cloudwatch, New Relic, Nagios, Cacti, Munin, or a combination of other excellent tools.

The point of load testing is to find the limits and then decide what to do about them. There will be a point where the cost to rectify the issue is greater than any immediate benefit, load testing will help you find that bar. During the class, limits of a 1000 simultaneous users for web and 5,000-10,000 MQTT messages per process were common.

Final project

For their final project two students from the class, Matthew Bentley and Andrew Mason, decided to take on some of the problems with mqtt-malaria and extend Locust to publish MQTT messages. Using Locust they were able to scale their load test infrastructure across many machines and put a broker under more stress. In their previous testing with malaria, they found the point where a single device could send no more messages (at a reasonable rate), but they could not scale malaria to determine at what point the broker would not process any additional connected devices’ messages. Through their efforts, they reached 100% CPU on the broker, pushing 1 million messages a minute to 4000 users. As a result of their work they also open sourced their contribution to locust.