GoGuardian co-founder and CTO Advait Shinde discusses how we use Amazon Aurora to improve scalability by decreasing replica lag and bypassing limitations inherit in enterprise-grade relational databases like AWS RDS.
Advait: I'm the co-founder and CTO of this company called GoGuardian, we're in El Segundo. My ten year old cousin uses a Chromebook every day of his entire life and he does this in school. He doesn't have any paper on him and all of his exams and all the essays that he writes and all the worksheets that he does are through a computer. When we were in school, we didn't have that. We're seeing this as a major trend in education. Children are learning with laptops every day. GoGuardian is a company that really facilitates that and makes that actually possible by not only collecting data about what students are actually doing but protecting them from the bad parts of the Internet. That's what we do.
The general theme of this talk is insane growth with an insane amount of resource constraints from an engineering perspective. This is January 2014 if you guys can see and the right is roughly present day. As you can see we grow ... It looks linear here but the points where it's flat is really just summer. It's very, very scary growth and I think we hit the one million student mark with three engineers. Lots of scalability problems. From an engineering perspective, what we actually do is we collect data about how ... What students are actually doing on their computers. They watch a YouTube video, we collect metadata about what the video is, what the title is, what the YouTube video ID is and in aggregate we pull all of these events in and show this information back to teachers and school administrators with the goal of improving learning.
As we hit the million student mark…we had engineered a solution that worked great for 10,000 students or 100,000 students but we were having lots of difficulties at the million student mark. The Amazon Aurora announcement happened right around here and we were absolutely thrilled because it was a free 5X gain in scalability and we couldn't really rearchitect our data storage layer or our data pipeline with three engineers because we were still building features as a startup. We were absolutely thrilled, and let me tell you how it ended up working out. Before I continue, if you guys have any questions, please stop me in the middle, I'd love to make this like a back and forth thing. Yeah.
[Audience] What were you using before Amazon Aurora?
Advait: MySQL. It was a PHP MySQL monolithic app. We were moving really, really fast in shipping features and the whole MySQL compatibility thing was a very, very holy grail. A solid question. I think the first thing that you guys are wondering is the performance boost is real. They're not just making up benchmark figures and let me show you why. This is a graph that we have in one of our inserting applications, like database insert applications. The red line is actually the 95th percentile insert latency and it's because we're doing bulk inserts. The blue line down here which is essentially the X-axis is the median latency. You could see we deployed Aurora right here. You could see the 95th percentile just drop and it's like 1.88 milliseconds per record. Obviously when you're measuring medians, you don't really see an appreciable difference. I think the Aurora team talks a lot about throughput but we really, really noticed it on the extreme ends of inserts…Here's another application we have. This is just a sample API endpoint and we have again, the 95th percentile in red, you can see the median jumping up a little bit to maybe one or two milliseconds but it just drops. This is like the deploy again. Absolutely phenomenal for us from that perspective.
Another thing is you guys should use read replicas. I think the Pinterest engineering team actually published a blog post earlier this year about how they manage to scale to I don't know how many hundreds of millions of users that they have. They've managed to do it with MySQL sharding and it's a very, very, very good blog post. You should definitely check it out if you use MySQL or want to shard MySQL, but one of the points that they bring up is that with regular MySQL, slave or read replica lag is very much a problem and we ... It results in issues like this so sorry for the crude drawing but here you have a vertical line representing time. That's the beginning of time and this is the end of time. Eventually it will propagate to the read replica but if you perform a read of foo, any time between when you make the right and when the propagation happens, you get an inconsistent read.
This is a huge class of errors. We actually notice this when we were actually reading off the MySQL replica and with Aurora, we're actually seeing less than ten milliseconds, I took this screenshot earlier this morning, like, two, three, four milliseconds of replication delay. For transactional rights, for example if you're writing a bank balance account, maybe this is a problem but for non-transactional rights, for example if you're logging YouTube metadata, then this is less of a problem. Here's a spike that we have of, this is vanilla MySQL and it's a replica lag that we have on RDS proper. MySQL on RDS and the Y-axis here is actually in seconds so it spikes up to well over 700 which is almost like 12 minutes of replica lag which is absolutely absurd from an application perspective.
We actually saw it as a real graph in production. Our customers saw this as well. Here's a snapshot of the maximum replica lag that we're seeing for all four of these read replicas here. Now the Y-axis is not seconds, it's milliseconds. It doesn't even go above 20 milliseconds over the course of ... I think this is three days over here. This is real graphs in real Prad. No benchmarks.
I'd like to rebut the Pinterest engineering team if you're using Aurora, you should think about using read replicas because the replica lag is really a game changer compared to regular MySQL. How many of you understand what an IOP is? How many of you understand what an IOP is in the perspective of MySQL select statements when you're an application developer? I think it's a very, very hard problem when you go to provision or regular vanilla MySQL RDS instance, you get a little screen like this where you have to allocate some amount of storage and then you got to provision some amount of IOPs and you're an application developer, you're writing the select and insert statements, and you don't really understand how that translates to IOPs. I don't think anybody actually does. All you know is how many IOPs you're consuming for a real application and then you just ratchet the number up 1,000 and yeah, you get a little bit better performance but it's such an opaque metric, I'm really, really happy that the Aurora team decided to forgo it and instead, it's just $0.02 per million requests.
Here's some actual real cost numbers for Oregon, the storage rate as well as the IO rate as well as the instance rate that you pay. Yeah, that was a huge one. Another huge one for us was really fast failovers and restarts. This isn't really like a ... I guess you guys do talk about availability but we've seen failovers happen and failovers do happen, it's silly to architect a system without the expectation of something failing over. Here's a real application graph, here you have 712 and 30 seconds and then I think this is about where the failure happens. The Y-axis here is latency in milliseconds so you see our API endpoint really, really suffering because of it, but literally, one minute later, it is fully subsided and I have a feeling that this is so wide because of actual DNS propagation and not necessarily because Aurora is failing over. I'm definitely going to check out that DBD Driver because I think that'll even lessen this gap. Yes, failovers do happen and in the MySQL world, whenever we failed over, it would be minutes, sometimes like tens of minutes before things actually came back online and it's pretty awesome that you can fail over a restart within less than a minute.
Final point. This is less related to Aurora but we really learned this the hard way and I think if you take one thing away from this talk is don't write to your MySQL or whatever database directly. You should buffer your writes with Kinesis or Kafka. This was our original, monolithic PHP MySQL application. This is how a lot of startups get off the ground and this is where we really proved our business value. It started to break in this situation. So over here, you have Y-axis as incoming right throughput in the form of HTTP requests. Here we're ingesting data so as students use their Chromebooks, we're collecting data about what websites they're visiting, et cetera. We don't really have any control over what's happening here so it's just this big blue roller coaster.
As time goes on you have a real practical maximum of what your database write throughput is. It's a finite number and a lot of startups don't really hit it or companies in general don't really hit it but it's really, really important that you know what that ceiling is because when it hits, how is your application going to perform? If you are forced to ingest all of these writes, what does your application do? For us in this architecture, the answer was we just dropped the writes on the floor. That is obviously not okay. What we did is we pushed rights into Kinesis first and this provides a bunch of ways which I'll get to in a second but for those of you guys who don't know, Kinesis and Apache Kafka are essentially distributed message cues or write ahead logs almost where you can write stuff into them in a very, very high throughput way and you can have multiple readers that read off these streams and do stuff with it. In our case, we're pulling hundreds or thousands of messages of this Kinesis cue and it integrates into MySQL, or now I guess it's Aurora.
The Kinesis and Kafka throughput is obviously naturally much, much higher than MySQL as well as Aurora because it's not doing any application logic with it, it's just buffering the logs. It's also very easy to add more data stores so this was a very unintended consequence of our architecture design. We didn't really expect the use case of adding more data stores. Instead, we were very, very reliant on MySQL and Aurora but we needed this buffer in the form of Kinesis and Kafka to deal with this dotted blue dot problem. As a result, as a side effect of this architecture, we are able to just add more readers to our Kinesis stream and insert into other data stores so now we're branching off into using S3 as raw data storage as well as other data stores like Druid and it's all possible because we buffered our writes.
That's really all I have to say. My name is Advait. If you enjoyed this stuff, come find me afterwards. I have a whole bunch of GoGuardian stuff here so if all you want is GoGuardian stuff, you can come say hi and I'll give you some. Yeah, thanks for listening. I appreciate it.