Author: Steve Jackson

Why A Locust Swarm Is A Good Thing

locust_io_master_slave2_00Recently, I gave a talk on how “A Locust Swarm Can be a Good Thing!” at Stir Trek in Columbus.  The talk covered our experience of load testing and preparing for hundreds of thousands of users on the first day of the .Realtor site launch.

This was a challenging environment with lots of external dependencies of varying capacity and failure tolerance.  We knew from the beginning that we’d never be able to load test all of our dependencies at once, so we had to figure out ways to test in isolation without spending all our time rewriting our test infrastructure.  Additionally, our load testing started late enough that we would not be able to coordinate an independent test with all of our dependencies in time.  We also needed to figure out what our users would do without having ever observed user behavior on the real site.  We built a model user funnel to capture our expectations and continually tweaked it as we discovered new wrinkles.  This funnel formed the basis for our load test script and allowed us to prioritize our integration concerns.

In the end, we learned a lot about making our workflows asynchronous, linux kernel optimization, decoding performance metrics, and building giant DDoS clouds of load test slaves.  We also learned that load testing should start “earlier”.  Conversations about load and user behavior drive new requirements and testing can uncover fundamental infrastructure problems.  Decoding and isolating performance problems can require a lot of guessing and experimentation; things that are difficult to do thoroughly with an unmovable launch date.  It’s also difficult to make large scale changes to an application with confidence under time pressure.  One of my key takeaways is to be nicer to external partners.  The point of load testing is to find the breaking points of a system and most people don’t like when their toys get broken. Building trust and safety into that relationship is very important before trying to figure out where and how something went wrong.

Check out the slides to the original talk here.

Stir Trek is an excellent conference with extremely thoughtful organizers and friendly people.  20 people came to talk to me in person after my talk, which was great!  Tickets sell out very quickly, but I recommend getting in next year if you can!

IoT Course Week 11: Load Testing

Screen Shot 2015-10-15 at 3.07.21 PM

Last week, we dove into into the importance of incorporating and collecting analytics through your connected device, how that information helps provide business value, and played with some of the ways that information can be displayed using some pretty graphs.

This Week

This week, we’ll continue our focus on non-functional requirements and start load testing. With connected devices, if the device can’t call home to its shared services, it loses a lot of its value as a smart device. These services need to be highly reliable, but things get interesting when thousands or millions of devices decide to call home at the same time.

To load test, we’ll generate concurrent usage on system until a limit, bottleneck, unexpected behavior, or issue is discovered. This usage should model real-life usage as close as possible, so the analytics we put in place last week will be a valuable resource. In instances where we don’t have data to work with, we can build out user funnels and extrapolate based on anticipated usage. Bad things will happen if we ship thousands of products without any idea how our system will react under the load. This data will also be a useful baseline for capacity planning and system optimization experiments.

The Lampi system has two shared services that we need to put under load. One is the Django web server that handles login, and the other is the MQTT broker that handles sending messages to the lamp.

Load Testing with Locust

To test the web server we use Locust. Locust has become a LeanDog favorite due to its simple design, scalability, extensibility, and scriptability. We’ve used it to generate loads of 200,000 simultaneous users distributed across the US, Singapore, Ireland, and Brazil. These simulated users (locusts) walked through multi-page workflows at varying probabilities, modeling the end-to-end user interaction, complete with locusts dropping out of the user funnel at known decision points.

Locusts are controlled via a locustfile.py. The one below shows a user logging in and going to the home page:

from locust import HttpLocust, TaskSet, task

class UserBehavior(TaskSet):

def on_start(self):
self.login()

def login(self):
response = self.client.get("/accounts/login/?next=/")
csrftoken = response.cookies.get('csrftoken', '')
self.client.post("/accounts/login/?next=/", {"csrfmiddlewaretoken": csrftoken, "username": {{USERNAME}}, "password": {{PASSWORD}} })

@task(1)
def load_page(self):
self.client.get("/")

class WebsiteUser(HttpLocust):
task_set = UserBehavior
min_wait = 5000
max_wait = 9000

In order to run locust, we’ll need a machine outside of the system to simulate a number of devices. Locust is a python package, so it can run on most OSes. It uses a master/slave architecture so you can distribute the simulated users and allow for more and more load.

Once you install locust and start the process, you control the test via a web interface.

Screen Shot 2016-05-04 at 12.09.46 PM

Locust will aggregate the requests to a particular endpoint and provide statistics and errors for those requests.  

Screen Shot 2016-05-04 at 12.09.55 PM

Screen Shot 2016-05-04 at 12.10.03 PM

Load Testing with Malaria

To test MQTT we used a fork of Malaria. Malaria was designed to exercise MQTT brokers. Like locust, Malaria spawns a number of processes to publish MQTT messages. Unlike locust, it’s not easy to script; you have to fork it to do parametric testing or randomize data.

usage: malaria publish [-D DEVICE_ID] [-H HOST] [-p PORT] [-n MSG_COUNT] [-P PROCESSES]

Publish a stream of messages and capture statistics on their timing

optional arguments:
-D DEVICE_ID (Set the device id of the publisher)
-H HOST (MQTT host to connect to (default: localhost))
-p PORT, (Port for remote MQTT host (default: 1883))
-n MSG_COUNT (How many messages to send (default: 10))
-P PROCESSES (How many separate processes to spin up (default: 1))

By modulating MSG_COUNT and PROCESSES you can control the load being sent to the broker.

Running some Example loads
Small load: Using 1 process, send 10 messages, from device id [device_id]

loadtest$ ./malaria publish -H [broker_ip] -n 10 -P 1 -D [device_id]

Produces results similar to this:

Clientid: Aggregate stats (simple avg) for 1 processes
Message success rate: 100.00% (10/10 messages)
Message timing mean 344.51 ms
Message timing stddev 2.18 ms
Message timing min 340.89 ms
Message timing max 347.84 ms
Messages per second 4.99
Total time 14.04 secs

Large load: Using 8 processes, send 10,000 messages each, from device id [device_id]

loadtest$ ./malaria publish -H 192.168.0.42 -n 10000 -P 8 -D [device_id]

Monitoring The Broker

MQTT provides a set of topics that allow you to monitor the broker.

This command will show all the monitoring topics (note that the $ is escaped with a backslash):

cloud$ mosquitto_sub -v -t \$SYS/#

The sub topics ‘…\load...’ are of particular interest.

Gather data

Before we start testing, we should figure out what metrics we want to measure. Resources on the shared system (CPU, memory, bandwidth, file handles) are good candidates for detecting capacity issues. Focusing on the user experience (failure rate, response time, latency) will help you hone in on the issues that will incur support costs or retention problems. Building the infrastructure to gather, analyze and visualize those metrics can be a significant part of the load testing process – but those tools are also necessary to do useful operational support in production. For the class, students used sysstat, locust, mqtt and malaria to gather metrics. A production-like system might use AWS Cloudwatch, New Relic, Nagios, Cacti, Munin, or a combination of other excellent tools.

The point of load testing is to find the limits and then decide what to do about them. There will be a point where the cost to rectify the issue is greater than any immediate benefit, load testing will help you find that bar. During the class, limits of a 1000 simultaneous users for web and 5,000-10,000 MQTT messages per process were common.

Final project

For their final project two students from the class, Matthew Bentley and Andrew Mason, decided to take on some of the problems with mqtt-malaria and extend Locust to publish MQTT messages. Using Locust they were able to scale their load test infrastructure across many machines and put a broker under more stress. In their previous testing with malaria, they found the point where a single device could send no more messages (at a reasonable rate), but they could not scale malaria to determine at what point the broker would not process any additional connected devices’ messages. Through their efforts, they reached 100% CPU on the broker, pushing 1 million messages a minute to 4000 users. As a result of their work they also open sourced their contribution to locust.

IoT Course Week 10: Analytics

IoTBackground

Last week we got our feet wet with an introduction to Bluetooth Low-Energy on iOS. This week, we’ll dive into analytics, provide business value, and make some pretty graphs.

Why Analytics?

When building a new product, there are always a variety of options on the table with which to improve that product. At LeanDog, we practice a software development cycle that includes short sprints coupled with an open and honest feedback loop that provides us with the information we need to make informed decisions about where to focus our efforts and resources. This allows us to make sure that we are building the right thing the first time and minimize the amount of risk inherent in the process.

Until relatively recently, collecting feedback about a product in-use was a long process that required either direct observation or careful reading of written user reviews and complaints. Due to the complex and inconsistent nature of users, collecting strong quantitative data about a product experience can be difficult. In a now infamous incident from 2013, a New York Times journalist wrote a negative review of the Tesla Model S, only to have the car’s onboard analytics refute many of his claims. It is not uncommon for a customer to report one thing, but end up doing something entirely different, and your user experience process will need to account for these inconsistencies. One of the many ways we solve that problem is through the use of analytics platforms and reporting tools.

In addition to uncovering potential pitfalls, analytics are a powerful way for product owners, designers, and developers to understand how a product is actually used. For companies that make physical devices, this provides insights that are difficult to collect otherwise. Imagine receiving a coupon in the mail for a smart GE light bulb you love that’s nearing the end of it’s lifetime. The only way GE could possibly anticipate that your current bulb is about to go out (without calling you every day to ask how often you turned it on in the last 24 hours) is through analytics. With analytics, you get an avenue outside of sales to start to figure out which features and products your users actually love, which have problems or aren’t worth further development, and even identify disengaged users for retention campaigns.

Enter Keen IO
For this class, we will use a popular analytics platform called Keen.io. Keen is a general purpose tool, not locked into web, mobile, or embedded specifics. It has a large number of supported software development kits (SDK’s), including Ruby, iOS, Python, .NET, etc. It also offers a powerful free tier, which is perfect for the amount of traffic currently being driven on student’s LAMPi systems. Registering and sending a notification in Python is as simple as as this:

from keen.client import KeenClient

client = KeenClient(
project_id="xxxx",
write_key="yyyy",
)

client.add_event("sign_ups", {
"username": "lloyd",
"referred_by": "harry"
})

This will send an event containing the signup data to Keen’s database. Now back at LAMPi headquarters we can track those signups on a giant web dashboard:

var series = new Keen.Query(“count”, {
eventCollection: “sign_ups”,
timeframe: “previous_7_days”,
interval: “daily”
});

client.draw(series, document.getElementById(“signups”), {
chartType: “linechart”,
label: “Sign Ups”,
Title: “Sign Ups By Day”
});

image01

Keen also provides a number of ways to pull out the analytics data and do additional processing to get exactly the view we wanted. Like if we wanted to build a tree of who our top referrers are what their “network” looks like:

image00

What’s next?
Analytics can also provide a leading indicator to help model the number of users that will be pounding on your infrastructure. To learn more about how to address that issue, join us next week when we talk about load testing!

Making It Right, Making It Easy

 

In a recent essay by Kent Beck, he details two contradicting methods for introducing change in software:

1)  Make it run, make it right, make it fast.

2)  Make the change easy, then make the easy change.

The former is generally how we are taught to execute TDD – do the simplest (even stupid) thing to get the test green, then re-factor with your safety net in place.

As systems grow, the second technique becomes useful.  When you can’t grok all the code in front of you, work to break it down until you can attack the problem at hand.

The key element of these two techniques is to suspend thinking about the end goal and make ourselves temporarily uncomfortable to simplify the problem domain.  Either technique creates the momentum one needs to get to a solution, rather than ending up in “analysis paralysis” or thrashing between several contradictory goals.

As someone with legacy code experience, I usually reach for the second technique before the first, and have often felt guilty about it.  First, when carrying out an extensive re-factoring, there can be a long time where I fumble through the change.  That’s time where I’m not “adding business value” and “getting things done”.  Additionally, I’m abandoning my trusted feedback loop of TDD.  The tests will continue to pass (I’m re-factoring safely), but I don’t know that this “make it easy” step is actually leading me in the direction I need to go.  At this point in my career, I’ve performed several hundred pointless re-factorings that didn’t end in the desired result or advance my knowledge of the system in a meaningful way.

In some cases, it’s a bad idea to “make it easy” first. In others, there’s so much technical debt standing in the way, that “make it run” is at best a terrible, fragile hack.  And once it runs, it’s pretty tempting to mark it as done and pick up the next card, rather than taking the 3 weeks it will likely take to “make it right”.

I like the tone that Kent sets, where there is a time and place for each technique.  Thinking tactically, “make it run, make it right” is the safe decision.  Of the two methods, only the first has any value if it’s 50% complete.  I’d caution anyone considering “making it easy” to be wary of the time that might take and set appropriate timeboxes.  Technical Debt will eventually need to be paid down to make progress, but today might not be that day.  Of course, like any decision for short-term gain, it could be pretty painful if you don’t follow through and “make it right” before moving on.