One of the first things you need to decide when building a new application or major feature is how you are going to store and process the data for it, which means picking the right database for the job. This is, of course, assuming you will need to store and retrieve data and your storage needs are non-trivial. In this post I will attempt to walk through the various options and decisions you must make while selecting a database.
What makes this decision difficult is that you have a lot of options and there is a lot of variety to those options. There are relational databases, NoSQL databases. Transactional, non-transactional. Row-oriented or column-oriented. Community supported or commercial. And there are many subtle differences between each of those options. Since there are really too many options to cover everything in detail, my goal here is to give you a better understanding of how to evaluate your needs and match them to the right database.
Since practically all my experience is with open source databases, I will stick to what I know and just talk about that. I apologize to all you folks out there who insist on spending money to make you feel better about your decisions (cough, Oracle. cough, Windows…). Also, I want to avoid holy wars over your personal favorite database. There isn’t one database for every application or situation. So try to keep an open mind and objectively evaluate all options.
In order to evaluate which database to use, you must first understand your application and how it uses data. Here is a short list of key usage characteristics to consider:
- Storage size — How much data will your application use and store?
- Read:Write ratio — Will your application be read-intensive, write-intensive, or fairly balanced?
- Concurrent users — How many concurrent users will access your application or do you need your application to support? i.e. how much traffic or database transactions per second do you expect or need to plan for?
- Data model — What is your data model and how is it used?
- Security, stability, and availability — How much downtime or data loss can you tolerate? How sensitive is the data you are storing (financial or personal information)? Are there service-level agreements (SLAs) you must maintain?
With that knowledge of your application, you can better evaluate your database options and how they apply to your situation. Here are the three top factors that will help you pick the right database:
- What databases are you or your team experienced with?
- How well does your data model fit into the databases you are considering? Do you need a more flexible model?
- Will you need to split your database across multiple servers?
There is a lot of weight in your experience. A hot new database is great, but if you already have a lot of experience with a particular database like MySQL then you need to have a good reason to try something new. You have to be prepared to not only develop your application for a new database but to support and maintain it once it is in production. A new hot database also hasn’t been around as long for bugs to be found and fixed. It therefore may not be as mature and stable as something more traditional like MySQL or PostgreSQL. So you may run into more bugs, crashes and maybe even data loss. Even though there are ways to handle and prepare for failures, you have to understand and prepare for working with something unknown.
A challenge too, when deciding between relational databases or NoSQL databases, is that the APIs for NoSQL databases are very different than for relational databases. Relational databases are fairly universally supported through standard APIs like JDBC (Java Database Connectivity) for Java or DB-API for Python with standard interfaces for executing SQL and processing results. NoSQL databases, on the other hand, by definition do not use NoSQL and have database-specific APIs.
The data model is important when considering a database, especially when you get into the world of NoSQL databases. They typically have more flexible data models and sometimes fit better with your application. With a relational model, there is an impedance mismatch when trying to map it to classes in an object-oriented programming language. ORM (object-relational mapping) frameworks make this easier, but they tend to have a lot of performance overhead and some are difficult to setup and maintain. NoSQL databases, on the other hand, usually matche the data model in the application much more closely with the data model in the database. Between the flexibility of the data model and being able to better represent that in the application, NoSQL databases can be very good choices.
There is also another aspect of NoSQL databases which is important to consider. That is NoSQL databases tend to not be as feature rich as mature relational databases. Part of the reason is they are relatively new and the developers just haven’t had the time or resources to add certain features. Sometimes the reason is the feature is really hard or impractical to implement at the scale or in the distributed nature of the database. A few features that some NoSQL databases lack are:
- Transactions — This tends to be a feature NoSQL databases leave out.
- Indexes on fields other than the primary key — MongoDB has great index support, just as you would expect from a mature relational database, but databases that are key-value stores like Redis or modeled after Google BigTable like HBase are only queryable by their primary key.
- Foreign Keys — Also tends to be a feature NoSQL databases leave out. MongoDB has a convention for referencing other documents, but there is no constraint that is maintained by the database preventing broken foreign keys.
- Joins — Along the same line as foreign keys, joins are not typically supported. One way to get around this is to denormalize the data, embedding referenced data withing the referencing document. This doesn’t always work well, but in some cases can greatly improve database performance.
Distributing Across Multiple Servers
There are several ways to distribute your database across multiple servers, but they fall into one of two categories. The first is replication, where the entire database is replicated, or copied, and maintained consistently across every server in the cluster. Basic replication involves one master server capable of reading and writing data and several slave servers which are read-only. The slave servers are beneficial when your application has high read to write ratio, so that the majority of read operations can query from the slave servers and thus distribute the query load. Some database servers can run in a multi-master mode where more than one server can act as a master and be written to and read from at the same time. The limitation of replication is that the data set must fit in each server, so it is limited to the size of your smallest server. The benefit of replication is distributing the processing of queries across multiple servers.
The other main category of distributing across multiple servers is partitioning, where the data is split across the servers. This gives the benefit of distributing the data across the cluster so no one server contains all data for the application. Usually there are three copies of any given record, so if one database server fails, there are two copies available so there is no data loss. With replication schemes, the entire data set must fit within each server, so it is limited in how big the database can scale. Most NoSQL databases primarily use a form of partitioning (some, like MongoDB, call it sharding), while some offer both partitioning and replication (MongoDB employs both to provide the highest availability and reliability). This makes NoSQL databases very good at huge data sets.
One important thing to note about dealing with distributed database systems is that there are always tradeoffs made due to the many challenges in providing a consistent, available, and fault tolerant system. Many NoSQL databases, when fully distributed, follow an eventually consistent model, which means data written to one server in the cluster is not always immediately visible when read from another server in the cluster. Often times there are ways to ensure more consistency if you need it, but it is usually something you need to be aware of or configure when working with a distributed database.
Perhaps the biggest difference between various database options is to choose a relational database or a NoSQL database. Here is a quick rundown of the pros and cons:
|Well supported||Performance drops at very large data sets||Flexible data model||Not as mature|
|Well understood||Limited or challenging scalability||Highly scalable||Non-standard APIs|
|Fast and easy to setup||Cumbersome to code against||Easy to code against||Not as feature-rich|
Beyond the relational vs NoSQL decision, the choices get more subtle. In terms of relational database options, it is basically down to MySQL vs PostgreSQL. I prefer PostgreSQL, but they are very closely matched.
For a detailed review of the features, pros, and cons of many of the NoSQL databases out there check out this blog post by Kristof Kovacs. He does a much better job than I could do here. However, my personal favorite NoSQL database is MongoDB. I have used it for a very high volume application with huge amounts of data in a fully clustered environment and have been very happy with its performance.
I hope you have enjoyed this topic and that is able to help you pick the right database for your application.