Our mission is to make web business personal. We want our customers to be able to have the same delightful, personal experiences talking to their customers as they do talking to their friends.
Enabling all these conversations and interactions between businesses and their customers requires an extremely robust storage system that can scale as the data set expands – our storage needs, after all, have to keep pace not just with our own growth, but with the combined growth of all our customers.
In our system, we use the term “users” to describe the people who interact with our customers via Intercom, and needless to say, that means storing a very extensive live list of users.
“Managing user storage is the very foundation of the Intercom platform, and has required constant and rapid evolution since we started the company”
Managing this user storage is the very foundation of the Intercom platform, and has required constant and rapid evolution since we started the company. Recently, we completed an ambitious transition to an entirely new user storage system – the story of how our user storage has evolved from our earliest days to now shines a light on how Intercom itself has evolved and scaled as a platform and as a company over the past eight years.
A brief history of user storage
2011 – MySQL
In the beginning we used a single Amazon RDS (relational database) instance for everything – including user data.
When we launched custom data for users, we opted to serialize it in a MySQL TEXT field.
This solution worked until we wanted to improve the user list by allowing customers to sort and filter on custom data, at which point we needed a more flexible solution.
2011-2014 – MongoDB
The solution we turned to was MongoDB, which was amazingly flexible for our purposes. It allowed us to easily store custom data and other complex data structures. We implemented a very flexible query pattern to power the user list. For our early years, this was more than adequate.
However, things that worked at low scale started to creak as we grew – we encountered problems powering the user list with MongoDB, and began to experience timeouts for some customers. It became a challenge for the database to serve a high volume of “single user reads/writes” as well as the search workload, and it became clear that we urgently needed to upgrade.
2014-2017 – MongoDB and Elasticsearch
In response, we started streaming user data out in to Elasticsearch and used it to power the user list. Around the same time, MongoDB itself got a lot more stable.
That period of stability was not to last – in late 2016 MongoDB really started creaking for our purposes. We experienced regular outages (sometimes daily) and the constraints on large customers became untenable.
Internally, too, we were suffering – our infrastructure team spent nearly all their time keeping user storage alive, and it was not the sort of problem that we could solve by spending more with AWS.
Time to make a change
The situation was unsustainable and extremely stressful for the team – folks who were on call had to take their laptop with them everywhere, even to parties. That constant state of anxiety that something might break at any moment takes a serious toll, and it was clear something needed to change.
“To support our new scale, we needed to fundamentally change our user storage layer”
To support our new scale, we needed to fundamentally change our user storage layer.
The options appeared to be either:
Re-architecting with MongoDB to be actually horizontally scalable (which was made difficult by the initial setup that also accommodated search).
Achieving the same impact with another database.
Evaluating AWS Databases
Our philosophy of “Run Less Software” manifests in many ways, but one of the most far-reaching is that we don’t like running databases if we can help it. In order to see if there was an opportunity to shed that responsibility, we prototyped user storage on a couple of Amazon’s databases.
We looked at both Aurora and DynamoDB as possible candidates. Unfortunately, neither proved optimal:
Aurora could easily handle the uniqueness constraints that we needed to enforce on users, but it was clear that we would need to do a bunch of work to get it to support the level of reads and writes we needed.
DynamoDB could handle the R/W volume very easily (and affordably), but as a relatively simple key-value store, it couldn’t easily handle our uniqueness constraints.
The idea behind USV2
We realized, however, that it was possible to break this hard problem up in to two smaller problems, each of which could be solved by Aurora or DynamoDB.
We could use Aurora for “identity” – just the subset of data that needs a uniqueness constraint on it. This data changes rarely and is relatively small in terms of total size.
We could then use DynamoDB for key/value “blob” storage. This data changes a lot, and there is a lot of it – somewhere over 6TB, with about 5k writes a second and more than 10k reads a second. This volume is easy for DynamoDB.
Building in small steps
Another internal philosophy is that we like to ship big changes as a series of small safe steps. It was of paramount importance that were particularly deliberate and controlled when making large changes to such a core part of Intercom, which is a far from trivial task. This is how we approached it:
First we introduced a user service layer, and ensured that all access to user data went through this layer. This gave us a clear understanding of all of our access patterns, as well as the ability to make changes behind the service layer without other teams having to know about it.
Then we introduced the identity layer in Aurora, while still using MongoDB. This allowed us to simplify our access patterns on MongoDB to key/value – just like they would need to be on DynamoDB after migration.
We then migrated visitor data out of Mongo and in to DynamoDB, and finally moved user data in to DynamoDB.
At every step along the way, we had periods where we were dual writing to both data stores. This meant that when things went wrong (as they inevitably did), we could seamlessly change back to the previous version, fix the issue and roll forward again.
Eventually, we were able to stop dual writing user data in to MongoDB, and officially moved to DynamoDB only, the culmination of many years worth of careful work maintaining and iterating on our user storage.
Counting the benefits
The impact made by our infrastructure team with this transition has been immense – our user storage is in a better place than it has ever been.
More scalable: We could 10x our customer count, and the number of users our customers have, and the rate at which each user is read or written, and DynamoDB would swallow that up. It autoscales, so assuming these changes happened gradually we wouldn’t have to do anything.
More secure: Our customers’ customer data is among the most precious data we have, and this change has made that critical data more secure than ever before.
Lower cost: The savings from this innovative solution are in the region of $500k a year.
Less maintenance: Not having to manually manage databases is huge, as this set-up is considerably less labor-intensive than the distributed databases we were previously managing. Significantly, we haven’t experienced outages at the level we had in the past since we made the move.
Less stress: The team can now go to parties without worrying they might need to battle to keep our user storage up and running. The time we save is significant, and allows us to spend much more energy and attention iterating on and evolving the user storage model, and more importantly, building new products and features for our customers.
“We managed this migration of 3 billion users and 4.5 billion visitors with no downtime whatsoever”
Most impressively, we managed this migration of 3 billion users and 4.5 billion visitors with no downtime whatsoever. It’s the engineering equivalent of the proverbial “changing the car engine while still driving”. We’re extremely excited to see where this new engine takes us.