Safer Students. Better Learning.

Monitoring Applications with StatsD

Advait Shinde Advait Shinde Oct 7, 2015 2:02:37 PM

We run a lot of services and background workers at GoGuardian. StatsD is a dead-simple approach that enables us to instrument our applications with granular monitoring even before deploying to production. I gave a talk at Los Angeles's monthly JavaScript meetup JS.LA explaining how we use StatsD. In case you didn't make it to the talk, here is a video of the event:


 For your convenience, here is the video broken down per slide:

What is StatsD?:

Advait: Hey everyone, my name is Advait. I run the engineering team at GoGuardian. We build analytics and monitoring software for Chromebooks, primarily for schools.  I’m here to talk about StatsD! So, just, a show of hands real quick… how many of you guys use StatsD in production right now? How many of you guys have some sort of monitoring setup for your services and things like that? Okay, cool. So, for the rest of you, this is going to be awesome. So, what is StatsD?


How StatsD originated:

Advait:  The guys at Etsy in 2011 decided that we need a way for engineers to be able to track applications for specific metrics. The idea was, back in the old days of thinking, when engineers developed an application, they packaged up the application binary and they kicked it over the wall to the ops guys. The ops guys would go and run it and then it would be the ops guys’ problem. Obviously, as many of you guys know, that trend is gone. Nowadays, you don’t even have ops guys anymore. For example, all of the engineers that push code at GoGuardian run the services.

So, how did ops guys make sure things were up? As an ops person you have metrics that are not very fine grade, to show your applications are doing: CPU utilization, network IO, things like that. But, you’re really unable to crack into the app binaries and figure out what’s going on -- why are my requests taking so long? So, the guys at Etsy decided, “Hey, we need a way where developers who actually understand the internals of the application can outfit their application with metrics! We need a dashboard so that we can view these metrics and figure out what’s broken, why it’s broken, in a level that’s much more meaningful than just CPU utilization and network IO!”


The 4 Major Pieces of StatsD:

Advait:  StatsD consists of 4 pieces. You have the API client, which integrates with your actual application (originally the implementation was just in node). Since then, StatsD has grown in popularity and there are API clients for literally every application backend: python, java, scallop, etc.  Anything you need, you will get a StatsD client for it. One of the guiding principles of the StatsD implementation was really just simplicity. They didn’t want to reinvent the wheel; they didn’t want to create this huge monitoring brain work, etc. In turn, they decided to make it really, really simple. So, you have a UDP protocol, which your API client’s wrap, and then on top of that, you’ve got this Daemon which runs, typically, on your actual application server.


The Data Flow of StatsD:

Advait:  You outfit your application with metrics, which will send UDP packets to a local Daemon. The local Daemon will then aggregate these packets together and send them up in batch to a backend. They kept StatsD as the very left hand side of this picture, and left the backend as it’s own thing. Again, don’t reinvent the wheel.


Why does StatsD use UDP?:

Advait:  So, a common question that people have when they look at StatsD is “Why UDP? That doesn’t make any sense.” I think it’s a great question! The primary reason is that you don’t want your metrics to impact your application, in terms of performances. You don’t want to spend 20% of your cycles collecting and processing metric data. You want this to be as lightweight as possible. UDP is kind of send and forget. Although, it doesn’t guarantee delivery.  However, when you’re running a local host there really isn’t much of a problem, so it’s great.


Backends: How StatsD works with Datadog:

Advait:  So let’s talk about back-ends! Two slides ago, you remember on the right hand side was the backends, while the left was the StatsD as the left hand side. So, the original implementation started off with Graphite. Graphite is a really common graphing and metrics collection application. There are many back-ends these days, but the one I’m going to focus on today is Datadog.  Datadog is proprietary, and we really rely on it internally. It’s the greatest thing ever! So, I’m going to talk about it!  But, I know a lot of you guys love open-source, so, Graphite, Ganglia and several other more open-source backends will be able to do all the metrics collection you want, for free.


More Backends to choose from:

Advait:  There are also more backends available! If you want to take raw metrics, pipe them into a database and do your own sort of aggregation/cool stuff on top, you can do that. There are quite a few more, this is just the top 3 that I thought of.

How to Setup Your Client Implementation:

Advait:  Let’s talk about actual implementation. Again, guiding principle of StatsD is simplicity. With two lines of code you’ve got StatsD client (I’m going to talk about JavaScript here obviously. But again, this is really client agnostic). There are similar idioms in literally every other language out there. So, you get a handle on the StatsD object, tell it to connect to a local host on the UDP port 8125, and then you’re up, running and able to collect metrics.


Types of StatsD Metrics:

Advait:  So, let’s talk about the StatsD metrics. There are a few primitives that you can use, which will help you build your bigger picture.


StatsD Metrics: Using Counters as a Primitive:

Advait:  The first metric type is a counter. If you want to literally count something, you will use a counter. We use it to track pageviews, failed authentications, database requests, etc.  There are many different ways where you can influence a counter. For example, you have standard increment and decrement operations, and then you have a count operation.

One thing I want to talk about here is the strings.  StatsD kind of requires you to associate a string with a metric you’re collecting. There are all sorts of conventions for how you should do that, but we use periods to deliminate namespacing (which is the common industry convention). You would typically prefix auth.failure with the application that’s actually running the code and then maybe even add on the environment, whether it’s staging, or production. You get a string, and that’s your counter, and you can increment it, or decrement it.


StatsD Metrics: Using Counters as a Primitive, cont:

Advait:  This is what you get out of it!  A couple days ago we actually launched a production service and we did it in a phased rollout. We had direct control over how quickly we could send traffic to the service, or what percentage of our total traffic went through it. It was really important to me that this service did not go down, because it was really mission critical.  We did all of our load testing and due diligence, as we wanted to make sure it was functional. We went crazy with monitoring! Almost every aspect of it is outfitted with metrics. This is a simple panther.request. (Panther is the name of the service). So, just increment panther.request every time a request comes in. You can see the ramp up as we turned the knob all the way to 100. We reached a peak at about 97 requests/second. That’s the typical end of our really cyclic workload. We were able to see, as traffic was coming in, a nice simple, live, graph of what is going on. It’s so comforting when you can see this, as opposed to just CPU utilization. You’re able to more understand what is going on.


StatsD Metrics: Using Tags as a Primitive:

Advait:  Let’s talk about tags. Tags are a Datadog specific thing. They’re not really available in Graphite, but you can fake it. I want to talk about them because they’ve really changed how we do monitoring. The concept of a tag is that you have two dimensions with StatsD. You have the string metric types that you’re collecting and then you’ve got time. The goal is to figure out how your metrics change over time. Tags add another dimension on top of that. For example, when you send a metric, you can tag it with the host name, or you can tag it with the Amazon opsworks stack that it’s running on. Or, you can tag it with production -vs- staging. Then, you can filter what sort of tags you want to see, and also do cool stuff like this. We’re running on Amazon and we’ve got a bunch of PHP servers that are setup with auto-scaling. This graph is throughout the work day, and shows how our work load increases over time. Every slice is a new application server coming in and adding to the load balancing pool. So, you can see how many requests/second each individual slice is taking, and how it piles up. How we’ve done this, is we’ve tagged every metric with the host name.  With these tags we can build cool lasagna graphs like this. This is really visual. You can see if a particular server is unable to serve the number of requests per second that you’re expecting or see any kind of anomalies.

Question from crowd: Can you set locally the global prefixes in the tags?

Advait: Yes, absolutely! It really depends on the client library, but it’s really simple. Just prefixing the string. Just check the library, but, we recommend prefixing with your app name because your dashboard is eventually going to contain thousands and thousands of metrics and you want to be able to drill down by apps.


StatsD Metrics: Using Gauges as a Primitive:

Advait:  Gauges are the next metric primitive. The idea of a gauge is to measure the value of something at a particular time. The canonical example for gauges is measuring the fuel in a gas tank. We use them to keep track of how many connections we’re using in a database connection pool. This is a really useful metric because when services are being stressed, connection pool usage is a really good indicator of latency and when things are going to explode. With one line of code, any time a connection is grabbed from the pool or released back into the pool, you get this graph.

StatsD Metrics: Using Gauges as a Primitive, cont:

Advait:  Every bar here is a 5 minute sample. The y-axis represents the average number of connections that were held open in that 5 minute slot.  You can see we’re doing really well, we’re barely touching database connections.  When there is an outage or something serious going on, you’ll see the bars grow huge. Typically, they grow up to the limit of your connection pool.  It will be very easy to tell what is going on, compared to something like CPU utilization.

Question from the crowd: Would you report that on connection open and disconnect, or would you report that on an interval?

Advait: It really depends on what you want to do. We do it on every grab and release back into the pool, and that is what I would recommend.

StatsD Metrics: Using Histograms as a Primitive:

Advait:  Histograms is the next one and is really my favorite. The gist of a histogram is that you have a distribution of values that you want to be able to measure (the most common being request latency). For example, I send an upstream request -- how long on average is it taking me to return?  Or, I have a downstream request that I have to process -- how quickly am I getting stuff out the door?  The way we’ve done that is that we’ve got a request that we’re sending to an upstream authentication request server. Before we send the request we have a start time, and then after the request is complete we get a diff of the amount of time that has lapsed.

StatsD Metrics: Using Histograms as a Primitive, cont:

Advait:  The cool part about StatsD is that it will automatically keep track of the mean, median, mid, max and 95th percentile. When you’re tracking latency values, or any distribution values, it’s not sufficient enough to keep track of the average. You really have to keep track of the 95th, or even the 99th percentile, because it is a deceiving statistic reporting just the average. Here, in blue, we have the median and the 95th percentile, of latency. As you can see, we’re doing quite well, clocking in at about 9 milliseconds for the median. This is going to be as reported by the application, so it doesn’t know the actual internet latencies, only application specific processes. But, that’s pretty solid in my mind. Again, the coolest part about this, is when there is an outage going on and you’re looking for information about what’s really happening, this puts it in your face.  You can physically see: “My database latency for this specific database is going through the roof. My database connection pool is being maxed out for this service” and then you can respond to that. It’s just so much more intelligent than blind metrics.

StatsD Metrics: Using Sets as a Primitive:

Advait:  Sets just keep track of unique values. Here, I’m passing in the logged in user’s ID to show me the idea of logged in users that are there. So, every 5 minute interval we get the number of unique values/ID’s that were there.

StatsD Metrics: Using Sets as a Primitive, cont:

 Advait:  Unfortunately, there’s no magic behind this. All that is going on is that your StatsD Daemon is going to aggregate every unique value for a specific time bucket and sends the information up. However, it is not going to send every unique user ID up to a back-end. You’re not going to get magical analytics but, again, this is valuable information and is better than nothing.

StatsD Metrics: Using Sampling as a Primitive:

Advait:  The Heisenberg Uncertainty Principle says that you can’t outfit an application with metrics without affecting it. It’s true, right?  You’re doing some work to collect these metrics. In an ideal world it would be zero work, but in reality it’s not. This is where sampling comes in. If the number of requests you’re processing is through the roof, you’re spending a lot of time sending these UDP packets. It is lightweight, but, it is still not at zero. In this situation you don’t care about 7-8 figures of information; you’re looking for a rough idea of what’s going on.  Therefore, you can use sampling to ease the burden. The idea behind sampling, is that you probabilistically send up StatsD metric. For example, we have a sampling rate of 10%. Therefore, if math.random is less than 10%, the count is issued. But the thing is, because it only enters the “if statement” 10% of the time you have to ramp up the value that you’re incrementing. This is a lot to think about.  So, instead, StatsD simply put it in the library. Most StatsD clients have the ability to sample, so you just pass in the sampling rate and then forget about it and it will probabilistically do everything under the covers, including scale up. It’s fantastic.


StatsD Metrics: Using Events as a Primitive:

Advait:  Events are another Datadog thing and are really cool. You have all these graphs that show really important things that happen throughout the day which significantly impact your application. For example, the deploys, autoscaling events, and things that happen which affected the system. You want to be able to see what happened at this time to cause a huge spike in traffic.


StatsD Metrics: Using Events as a Primitive, cont:

 Advait:  So, here, we’ve got an example of how we did a deploy to our service which caused the requests per second to go to 0. The width of that gap right there makes me really sad and we’re gonna do a lot to shorten it. Luckily our clients are outfitted with back-off. So, as you can see, as soon as the application went back up, we have this huge spike and then it comes back down to normal. Don’t deploy your applications like this. Use a load balancer to take instances out of the pool before you deploy!  The point is, we did something, the graph changed, and we can see those things together. When things are breaking you have no idea how important this is.


Datadog Integrations:

 Advait:  Datadog has many integrations; using StatsD isn’t the only way to monitor your stuff. You’ve got all these other things that you’re using, like event streams (GitHub deployments or issues that are opened), New Relic for monitoring, Chef for configuring. Then you’ve got other services too: MySQL, Cassandra, Solar, etc. All of these things are up and running and have their own metric systems. Datadog makes it really easy to consolidate all this heterogenous community of metrics into one dashboard.


How StatsD and Datadog function with JMX:

Advait:  We did a test run a couple months ago with Cassandra. Basically, Datadog managed to pull JMX metrics. JMX is Java’s standard way of outfitting your application with monitoring, which definitely pre-dates StatsD. There is a lot of momentum behind it and the metrics it produces are great...but it’s Java specific. Datadog will integrate the information. You can see here we have Cassandra latency, with a MySQL read. MySQL is in dark blue, Cassandra is in purple, and the application time is in light blue. The Cassandra latency was pulled from JMX. The MySQL read latency was pulled from StatsD and you can see all these metrics coming in and presenting themselves on a single graph. It’s awesome, you should check it out.

Conclusion and Questions about StatsD:

Advait:  Yeah, that’s it! My name is Advait! If you like graphs, come join us. We’re doing some really cool stuff at GoGuardian and thank you!

Question from the crowd: Can you mix and match the clients in terms of you’re recording in the background what’s going on with PHP and maybe what’s going on in Node in the background, and front end Javascript which obviously isn’t local host?

Advait: So, front end Javascript is tricky because you’ve got clients all over the place and you’ve got to funnel metrics into a Daemon that’s running somewhere. I wouldn’t recommend doing that. To answer your PHP + Node case, yeah, we have a bunch of services, Java, PHP, Node and they all use StatsD. They all have different clients, but, the Daemon that we run on the application services is the same.

Question from the crowd: So the metrics are all going into the same backend?

Advait: Yeah, exactly. It’s great, especially if you have a service oriented infrastructure. You can really correlate how negative stuff in one service affects other services in one dashboard.

Question from the crowd: Was the Datadog choice, was it specifically because of those things you mentioned are Datadog specific?

Advait: Yeah, I really didn’t want to host my own Graphite server. Also, at that time we were really, really small. Like, 2 or 3 engineers. It just seemed like a really easy way to get up and running, and since then it has been great.

Question from the crowd: If you’re already using something like New Relic, that already does all your base level stats, what are the sort of interesting cases beyond the base level stats?

Advait: That’s a really great question, We use New Relic as well. My only gripe is that it does a lot of stuff. It will give you analysis of how you’re using SQL, and tell you, what are you doing man, add an index here. But, it doesn’t do everything. I want to track latency for specific call that I’m doing, or, I’ve got this crazy offline computation I want to do -- how long is that taking?  Datadog really allows you to get to that fine tooth comb. I want to outfit THIS part with metrics, I think they work really well together and the Datadog dashboard has the ability to pipe in New Relic information, like, all the graphs and all of the events and stuff. You should definitely check it out.


How are you monitoring your applications? Let us know in the comments below!

Written by Advait Shinde

Guest Writer Application