Facebook’s user DB — is it SQL or NoSQL?
During conversations offline, folks pointed out that my previous article talks about the problem gap that NoSQL should have solved for, but there is no clear definition of what NoSQL really is today. The definition has been stretched in various directions over time.
Ok, so what’s the big deal? How does the definition matter? Just tell me what Facebook does. Well, without knowing what we mean by “SQL” and “NoSQL” it is difficult to classify a purpose-built infrastructure. And to make matters worse, folks first try to research if they should “go with a SQL or a NoSQL database” for their application, and then start researching “which of the NoSQL databases is ideal”. But different people mean different things when talking NoSQL, so the choice of the database ends up being non-ideal. This leads to massive pain in running the database in production and continuously evolving the application on top of it.
Thus, before answering the million dollar question of what Facebook’s infrastructure is, let us set a reference point for classifying NoSQL databases by looking at some prevailing definition.
Common criteria used to classify databases as SQL or NoSQL
1. Query language used by the database
One criteria for classification was based on the database API. For example, databases that had structured language and supported statements like “CREATE TABLE”, “SELECT/UPDATE/DELETE”, etc were classified as SQL databases. On the other hand, NoSQL databases had a programmatic API and conformed to a different query paradigm such as key-value, wide-column stores, GraphDB and DocumentDB.
However, over time, NoSQL has been morphed into “Not Only SQL”. Databases such as Apache Cassandra (with Cassandra Query Language and Couchbase (with its N1QL query language — SQL for JSON) have support for a structured query language.
This method of classification is therefore, not very useful today.
2. Relational or non-relational data model
Databases that support relational algebra and have features such as foreign keys, joins and multi-row transactions are generally classified as SQL databases. On the other hand, NoSQL databases do not have these features. Secondary indexes are often a grey area in this definition. The NoSQL camp will claim they have global secondary indexes, but upon closer examination it generally turns out that the global secondary indexes are eventually consistent, and therefore not really usable in many applications.
3. Scale out for large datasets
SQL databases are not suited for large datasets, while NoSQL databases are ideal for rapidly growing data sets. For example, this site says:
If your business is not experiencing massive growth that would require more servers and you’re only working with data that’s consistent, then there may be no reason to use a system designed to support a variety of data types and high traffic volume.
NoSQL databases support large datasets by easily scaling out to accommodate more data.
4. High Availability (HA)
SQL databases are not highly available — the end users generally build HA on top by replicating data, monitoring failures and promoting slaves to masters. NoSQL databases, on the other hand, come with the HA built into the database itself. The data is automatically replicated, the database monitors for failures and promotes replicas to serve data.
Its been about 4 years since I worked at Facebook, so things may have changed. But even the Facebook of 4 years ago had tremendous scale.
Here are the main points about Facebook’s user database:
- Consists of a very large number of sharded MySQL databases
- The application figures out which shard contains what piece of data and queries the correct database.
- The “application” referred to here is really more an application server which exposes a graph database with an integrated cache fronting the MySQL tier. The application server is called Tao.
- Each MySQL database (which is one shard of the entire data set) is replicated for redundancy purposes
Now let us apply the different criteria we listed above to see if this is a SQL or a NoSQL setup.
1. Query language = incorrect classification
We have already talked about why this is not a good criteria for classifying databases.
2. Non-relational data model = NoSQL
The application does not use relational queries such as joins, foreign keys or multi-shard transactions. Global secondary indexes that span shards are not supported by the database.
3. Scale out = NoSQL
Facebook put in a lot of intelligence in the application layer as well as the operations side to make their setup a scale out solution. This involved pre-sharding the data into multiple databases, placing multiple logical databases on each physical machine and enhancing the application to be smart about reading/writing data from the correct shard. Adding nodes into this setup was achieved using additional operational automation to move and rebalance data. The resultant effect is a scale-out database that behaves pretty much like a NoSQL database would.
4. HA = NoSQL
Facebook builds and maintains a custom solution to achieve HA. Using this custom-built solution — which takes serious engineering investment, they can easily detect node failures and failover to replica databases to achieve HA. The solution as a whole is aware of where the data is replicated in order to be able to serve data on failures.
Moral of the story
While Facebook’s “sharded SQL” solution is built on top of an MySQL, the solution in its totality behaves like a NoSQL database – it does not use a relational data model (for example joins and cross-shard transactions), and is a distributed data infrastructure built for scale out solution and HA.
Facebook has managed to build “their” custom NoSQL infrastructure for serving mission critical data, but it takes very serious investment to do this. In principle, NoSQL databases are supposed to have solved a similar problem for everyone at large. They should already be better and simpler than a sharded SQL setup. But this could not be further from today’s reality, so what gives? Well, that is a topic in itself.