NoSQL databases are not new. But they’ve definitely gained traction in the last few years. Designed with distribution and scalability in mind, NoSQL technologies lend themselves to fill the gap where traditional RDBMSs fall short.
Traditional RDBMSs generally support vertical scaling: to improve performance of the database, the host machine performance must be increased – i.e. having a faster CPU, faster hard disks etc. While these changes may be simple to implement for a small organisation, the associated costs scale exponentially with the purchase of more powerful hardware and licenses. Vertical scaling is also not elastic, it doesn’t seem worthwhile downgrading a server after high period of load and adding/upgrading the hardware often requires a planned downtime.
NoSQL databases offer the ability to scale horizontally. That is, that database performance can be increased by adding additional machines to a cluster. Load and data can be distributed between the nodes, reducing the load and demand on any single machine.
Both NoSQL databases and traditional RDMBSs support replication (data hosted on multiple nodes) and sharding (data split between nodes). So what makes NoSQL special? Relational Database Servers typically were built to optimise data consistency and availability and have had distribution added as an afterthought, whereas NoSQL databases were built with distribution in from the ground up.
This leads nicely onto the CAP theorem – coined by Eric Brewer in 2000. A distributed system cannot simultaneously provide all of the three quality attributes:
- Consistency – A read sees all previously completed writes
- Availability – Reads and Writes always succeed
- Partition Tolerance – Guaranteed properties are maintained – even if a complete network of machines are not available
RDBMSs implement both consistency and availability, and as they do not need to implement partition tolerance, they can readily achieve both consistency and availability. Checking for network partitions is an expensive task, and requiring for all reads/writes to a database would add significant latency to requests. As such, NoSQL databases are built with the assumption that the network will suffer from partitions, and that this performance characteristic is required. Given this constraint, the database must either implement consistency or availability characteristics during a network partition. However, these choices are not binary, and tradeoffs can be made between consistency and availability.
In an AP database (availability and partition tolerance characteristics selected), when a network partition occurs, servers both sides of the partition are still able to offer reads and writes over the data. Writes to the databases are uncoordinated, meaning that the one database effectively acts as two disjoint databases until the network partition is resolved.
Choosing consistency does not mean that data is not available when the network is partitioned. Reads may be permitted. However, a strictly consistent database would have to ensure data on both sides of a partition are consistent, meaning that writes may be denied.
So you might think, I want my data to be consistent? Why would I chose an available database?
Imagine a hotel booking website which is hosted in 2 regions US and EU that is subjected to network partition. With a consistent database, no customer would be permitted to book hotels until the partition is restored. This would cost the business money as revenue is lost due to this condition. An error would have to be presented to the customer stating that making a reservation would not be possible at the time. With an available database, two customers on both sides of the partitions would be able to book rooms. However, there is a chance that the hotel could be over booked when the network is restored and the changes merged between the partitions. This is an acceptable behaviour of the system: as hotels typically operate over-booking policies, this would reflect the real-world domain model and ensure that the hotel is always booked towards its capacity.
Several types of NoSQL stores exist (ordered by model simplicity and scaleability (simplest and most scalable first))
- Key-Value Stores: example: Project Voldemort: a large distributed, persistent hash table. High availability at loss of convenience. Not possible to conduct in-database joins. Usage pattern is as an application cache.
- Column Stores: example: HBase and Cassandra. Data is often stored in column families rather than rows. Typically sparse.
- Document Stores: example: MongoDB and OrientDB. Documents are a set of key-value pairs, however can have more complex structure. These databases better support aggregate/structured information.
- Graph: example: Neo4j. Data is stored on a object graph. Its easier to traverse relations and explore aggregates local to an object, but harder to compute aggregate results over the entire database.
One of the advantages of NoSQL is that there is no longer a data impedance in developing applications: a DAL or ORM does not need to be generated to convert a schema to object code. NoSQL offers development agility through being able to create or modify object types without the need to migrate all existing objects which are stored in the database. However, there are still challenges, as the unsupervised schemas now means that developers are responsible for managing different versions of objects.
Although the new NoSQL technologies support replication and are more scalable than relational databases, they are not a be-all-and-end-all solution to replace them. Relational databases offer a more structured database, that is easier to query. As relational technologies are very mature, there is strong consistency between the most common offerings such as Oracle, Microsoft SQL and MySQL.
One may wish to migrate to using NoSQL to increase development agility and also better handle larger loads or volumes or data. Migrating an existing application to use only one of these techniques seems like a costly venture. We need to consider cost/benefit advantages and tradeoffs. It may be more suitable to augment an application with one of these technologies to improve an high-load part of the application, considering which technology meets our needs.