The NoSQL DB hype
effective immediately, we have moved to our dedicated servers in http://www.brizoma.com/
Please update your bookmarks, since we will not update this site anymore.
See you in BRIZOMA.COM
Digg has decided to replace MySQL and most of their infrastructure components. They are probably benchmarking Twitter, who moved away from LAMP (Linux + Apache + MySQL + PHP) to another architecture around the NoSQL DB Cassandra, a project open sourced by Facebook in 2008 and licensed under the Apache License. Facebook is using Cassandra as their inbox search engine. It develops a highly scalable second-generation distributed database. The reason for this move, as explained for example by Digg, is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight. This growth has forced them into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead.
Some opinions state that Cassandra developers are celebrating that their database is now used to store the largest amount of worthless information in history. Others say that before this decision, MySQL was the best way to store data no one cared about. Going beyond this criticism, I decided to question the hype around the NoSQL phenomenon and try to bust some myths surrounding the relational vs. non-relational data bases.
2010 is a very exciting time to be a database geek. Back in 2003, there were essentially seven different free choices, all of which are SQL-based. In 2010, there are dozens of new databases, with about 60 different flavors to choose from.
Actually, the revolutionary tag associated with all of the new types of databases is not accurate. The database algorithms are all almost the same since 2000, and all of the new crop of database systems are new implementations and combinations of earlier techniques. The new systems are not revolutionary, just evolutionary from previous developments. The general belief says that non-relational databases lumped together as NoSQL have “radically different” organizations and use cases. But, that’s not just true for the non-relational databases, it’s also true for the various relational databases as well.
NoSQL databases can be tiny, such as embedded implementations of SQLite or CouchDB in your desktop’s tomboy application, and at the same time huge, like Google with Bigtable, Facebook using Memcached, Amazon with Dynamo, and so on. The discussion about which is the absolute best data base engine is the wrong one: you should choose the database system that fits the needs of the application, or to use more than one, such as MySQL together with Memcached or PostgreSQL with CouchDB. Another alternative is to use a hybrid, like MySQL NDB, which puts a distributed object database as a back-end to MySQL, or HadoopDB which puts PostgreSQL behind the Hadoop MapReduce implementation.
Let’s go back to the eternal discussion about relational vs. non-relational. Relational databases provide better transaction support than non-relational databases. Why? Mostly because of the maturity of relational databases. Relational databases enforce data constraints and consistency, because that is the basis of the relational model. Relational databases have other benefits like complex reporting capabilities and vertical scaling to high-end hardware. Instead, horizontal scaling is not well-supported and relational databases tend to have a high administrative overhead.
SQL promotes portability, multiple application access, and has ways to manage database changes over time. There are many mature tools to work with SQL, but SQL is a full programming language that must be mastered to take advantage of it. NoSQL allows fast interfaces to the data, without impedance-matching layers, which in turn allows for faster development. Typically, there are no separate database administrators (DBAs) for NoSQL databases, with programmers playing that role.
A SQL-relational database makes the most sense when you have “immortal data”. If the data being stored has is absolutely independent from the specific application and should be available to new applications in a future time, SQL-relational is probably the right choice. Let’s analyze some common problems where a DB is needed and the possible solutions.
- You have a blog and want to make it DB based: just use anything, including MySQL, PostgreSQL, SQLite, CouchDB, flat files, etc. Pick anything that is easy to install because it doesn’t matter.
- You want to use a database to unify several applications: to keep things consistent, e.g. a data warehousing application written in C++ together with reporting tools in Ruby on Rails, should use an online transaction processing SQL relational database like PostgreSQL.
- You need an application that is location aware: a geographical database, such as PostGIS, is the best. Geographical databases allow queries like “what’s near” and “what’s inside”.
- Your embedded hardware needs to write thousands of event objects per second: db4object is probably the right choice for your storage, but SQLite might also be considered, since there is only one instance writing data.
- Your web application needs to access 100k objects per second coming from thousands of separated connections: Memcached is a distributed in-memory key-value store, which is used by all of the biggest social networks. It can be used as a supplement to a back-end relational database. There are other possible alternatives like Redis and Tokyo Tyrant.
- You are developing a data mining application: you have hundreds of government and organizational documents you want to serve on the web or company intranet and mine for data: it was hard to get the data, and its structure of the data is completely unstructured, which means that the structure must be derived from each single document. CouchDB should be a good starting point.
- You develop a social application: you want to use the six-degrees of separation concept and implement a network of know who-knows-who, a problem that is very hard to solve with relational databases. Therefore you should use a graphing database such as Neo4j. Long chains of relationships are difficult for relational databases, but graphing databases, used in conjunction with another database, can handle these kinds of queries, as well as queries to find items “you may also” like, know, or be related somehow.
The real problems arise when you have to consider mixed cases. What if your queries don’t easily translate into SQL’s SELECT and UPDATE model? Maybe the data itself could fit perfectly into tables with foreign keys. The problem is, that you are trying to allow marketing people to slice and dice it in fairly arbitrary ways, since their needs change from week to week.
This was typical data warehouse-type stuff. Compute the monetary total of all orders for customers from this region, who use 15 different currencies. If that’s within a threshold value provided by marketing, then what’s the average age and number of ringtones ordered by them and standard deviation for the remaining customers who have dogs but did not use the function “recommend to a friend”, etc.
Then, instead of using PostgreSQL you will wish you had done the whole thing in MongoDB using map-reduce. It would have been a lot faster, both to develop and to run. You would not have to spend as much time figuring out which indices, counter caches, and denormalization that required to have this week’s reports on time, that is before the C-Level meeting in the company. So, even if your data model is nicely tabular, that doesn’t mean your usage patterns will be.
Summarizing, it’s currently fashionable to replace MySQL with some NoSQL database. This trend is driven by two factors:
- MySQL’s community is fragmenting into several forks as Oracle purchases the rights, which created the impression that MySQL’s development is entering a riskier, unstable period.
- NoSQL is the technology buzzword du jour everywhere. It’s difficult to overstate the impact of social forces on technology choice: most technology selections are governed more by what our friends say than by an impartial and disinterested weighing of merits.