Navigating the NoSQL Tower of Babel: Choosing the right database for your application

(picture taken from http://joitskehulsebosch.blogspot.in/2010/06/online-tools-for-internal-communication.html)

This is the 2nd in the series on unraveling NoSQL. The first article wondered at why everyone seems to be rolling their own NoSQL DBs these days. In this article, we will look at how to choose the right data store for your needs (I'm still struggling to call a less-than -100%-persistent data store a database thanks to CS564 at UW-Madison and internship at IBM-Almaden on DB2 :) but I'll use both terms interchangeably here)
The countless variations can easily baffle us, so we'll tackle this by looking at how data stores evolved from the beginning of Creation (Hello Darwin!)

First, people were storing data in file systems and writing complex COBOL scripts. Then Codd said "Let there be Relational Data Model!". He even made 12 commandments! This model was exceptionally good at managing structured data with well-defined Column Schema and Rows, like Excel tables. And people spoke a single language called SQL to query the data. Coupled with ACID, relational database management systems (RDBMSs) ensured that we would never lose or muddle our bank account data even in the middle of multiple concurrent transactions.

The data was structured, the DBMS was reliable, the DBAs were paid well, and everyone was happy.
Then the Web happened. And data became far trickier and larger.

Key-Value Stores: Web companies like google, amazon, and pets.com found that forcing data into predefined column structure was overly constraining when all you needed was a key (e.g., Customer ID) to get the value (e.g., name, purchase history, image). But one small thing: there would be many millions of these records and performance was essential. This led to distributed Key-Value Stores.
They used variations of Distributed Hash Tables (DHTs) to create highly scalable data servers. Google's Bigtable is the grand daddy of them all. Hadoop HBase is roughly its open-source version. Amazon's Dynamo optimized on Writes and gave high availability at the cost of some consistency. Facebook's Cassandra was a mix of both. [Check out CAP Theorem to figure out why we can't have more than two in Consistency, Availability, and Partition Tolerance in a distributed system.] You go for one of these systems when you are really dealing with BIG Key-Value DATA scaling to many many servers. As a bonus, you get great processing power by using MapReduce on these servers.

What if you had more specialized needs and needed something simpler to manage?

Document Stores: In many web applications, the schema varied from row to row, the column values took on set values (a strict no-no in RDBMS). XML was perfect for such data, now called documents. MongoDB and Apache CouchDB are very good at storing document data. They go one step further and store data as JSON documents which makes them really easy to process in Javascript. CouchDB takes the cake in storing document versions, which makes it good for Wiki-like apps and mobile apps that may have to work in a disconnected mode. MongoDB can handle huMONGOus data. Check out one of the many articles on MongoDB vs. CouchDB.

Main Memory Stores: Then Facebook and Twitter happened, and we started seeing millions of simultaneous users on a website. They were creating content, chatting with each others, and tweeting to the world. Going to the disk was not an option when you wanted almost real-time experience. Enter memcached - distributed main memory cache. A great solution to put in front of one of the persistent databases to improve performance.You want the data to be persisted on to the disk (without giving up much of the performance)? You can choose from redis, membase, etc. Redis gives pub/sub, a really neat way to push data to multiple browsers (e.g., stock updates, chat broadcasts,..), and supports data structures like lists. Couple redis with node.js with socket.io and you have got yourself a blazing fast chat server!

What about Querying in these key-value stores?

This is where I start to miss SQL. You are typically restricted to lookups based on keys and field values, with scant support for Joins. So, when you design your data model, you have to pretty much predict the kind of queries you are likely to ask. I suspect that Joins will make their way back into these NoSQL systems as well (e.g., Hive).

OK, which NoSQL DB should you use?

It depends on your application, really. But the above text gives hints on what these DBs were primarily designed for. It's possible that you may need more than one. For example, in a Chat application, you may use redis to handle rapid messaging and mysql for analytics on the old messages.

So, is mysql dead?

NOOOO! It's a tempting trend to say No to SQL and use one of these new babies. But that should not be a decision taken lightly. Mysql is still the most versatile, reliable, well-tested, and well-understood DBMS around. Worried about its scalability? You can use horizontal partitioning (sharding). Want low latency? Use memcached in front of mysql. Google apparently serves its ads from mysql.

So, is NoSQL dead?

NO again! E.g., If your application needs to handle unstructured data or is highly interactive, and it does not need complex queries, you should definitely evaluate one of the above. I've been playing with redis and am amazed at how well it fits my app's needs!

What next? There are many more variations on the NoSQL theme, e.g., to support graph data, Big Data Analytics (buzz word alert!), etc. You can find far more comprehensive comparisons in the following pages and on Quora.

NoSQL is a Horseless Carriage by Steven Yen. Check out the nice taxonomy
Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison

thanks, vishy

Vishy Poosala

innovate & impact

The Innovation Edge - Vishy Poosala

Monday, February 13, 2012

Navigating the NoSQL Tower of Babel: Choosing the right database for your application

No comments:

Post a Comment