no:sql(eu) and NoSQL: What’s the deal?

le 22/04/2010 par Olivier Mallassi
Tags: Software Engineering

When you took a look at the scheduled speakers, no:sql(eu) in London itpromised to be a fantastic event but this was not taking into account the fact that a volcano deep down in Iceland... NoSQL being about (among other things) availability even in the case of disaster, NoSQL.eu degraded gracefully and eventually took place in a pretty efficient way, with speakers giving their talks remotely, mainly from USA (sometimes very early in the morning due to time differences). This was also an opportunity to meet and discuss with Werner Vogels.

Anyway, here is what I will keep in mind after these 2 days in London: - NoSQL is about data modeling. The presentation of the Guardian introduced this concept perfectly: Oracle is still used for the core part because it fits our current needs, Redis is used for statistics computation, Google Spreadsheets are used to be more easily mashed-up with Google Maps. One dedicated storage solution for each type of data. Column Oriented, Document, Key/Value and Graph databases provides many ways of modeling and storing data.

- NoSQL is about data usages and reminds us that the most important - and I guess the forgotten part in our traditional systems - is to understand how data is used. A system can live with inconsistent data (in fact this is already the case if you are using database Master/Slave replication, caching or asynchronous approaches). This is not a big deal in most of the cases but we need to keep that in mind (I must admit that we, as architects, often forget about this concern since we have been using RDBMS and ACID transactions for a long time now).

What happens for the business if an operation (let's say a withdrawal) is consistent but my balance stays inconsistent a couple of seconds or even a couple of minutes?

Will my data be accessed in a highly concurrent manner and so do I have a high probability to get conflicts on that data? In case of conflicts, do I need complex conflict resolutions strategies for read-repair? more specifically, do I need a Vector Clock model or a much simpler but also efficient mechanism based on timestamps like in Cassandra?

How many revisions of my objects do I need to keep? For instance, Riak stores revisions of objects that can be used for conflicts resolutions but the number of revisions you need to keep have to be correlated to the probability of having a conflict on that data.

How much time do I have between the moment my data is written and the moment that same data is read? And what is the probability of being in an inconsistent state? If there is one hour between the write and the read, the probability is quite low.

Is my data mainly used for reads or writes ? Some of these systems have been optimized for writes and less for reads. For instance Cassandra does many more disk accesses during a read than during a write

- NoSQL is about softness and elasticity: Elasticity meaning "being able to add or remove more and more machines as you go", Softness meaning “having a loose data schema that will be able to evolve”. With neo4j, relationships between nodes can evolve, properties of the nodes or relationships can change. To put it simple, your graph changes the way you use it... In Key/Value stores like Voldemort or Column Oriented solutions like Cassandra, your model can evolve (add an attribute...) and storage will deal with it transparently. The next version (0.7) of Cassandra should implement a feature allowing dynamic and rolling deployment of new ColumnFamily around all the nodes of the cluster.

- NoSQL is about diversity. The NoSQL ecosystem is rich and composed of many different tools. Some of these tools are centered on simplicity and ease of use. MongoDB for instance provides a simple way for modeling data (in document and JSON style) with rich query mechanisms. Other solutions are focused on performance and scalability. Riak, Cassandra and Voldemort (which unfortunately was not present and I regret it because this solution is also used on e-commerce sites ) are the best examples of these kind of solutions. Some are focused on alternative modeling or even modeling that are impossible to achieve in a relational model - Graph Databases and Neo4j are the best examples. Neo4j can traverse a graph of 1 million nodes within less than 2ms on commodity hardware...

- NoSQL is about collaboration. This conference was offering the opportunity to see different approaches and how these approaches can be mixed together. For instance, you can perfectly have a system where a part of the data is stored and queried through Neo4j (imagine the value of adding relationships between metadatas...) with the objects themselves being stored in a Cassandra data store

- NoSQL is about Datawarehousing. Kevin Weil from Twitter talked about Cassandra of course but also talked a lot about how they use the 300Go of logs the site generates every hour. Scribe is used to collect data. Hadoop is used to analyze all that data through Map/Reduce (the number of tweets - 12 billions - are counted in around 5 minutes). Pig (a kind of high level language that democratizes data querying) is used to harvest that data. That kind of logging base is not only used for marketing purposes but also to understand technical problems...

- NoSQL is more about availability than necessarily about performance or scalability at the level Amazon or Twitter have reached. Jon Moore tells us that NoSQL is about availability and concurrency access of the data especially in multi-data center contexts. His talk reminded us about the CAP theorem and what partition-tolerant really means in the case of a complete datacenter failure and how to deal with CP or AP in that case.

- NoSQL can be about query. Of course Key/Value or Columns-oriented approaches have limited query capacities but graph databases like neo4j provide a rich way of looking for data (on nodes or relationship properties). Document databases like mongoDB or couchDB provide a way of finding, filtering data based on its attributes.

Of course they are some common drawbacks and limitations for these NoSQL solutions. - Key management is a critical aspect, since it is the entry point (at least for the key/value and Column oriented systems) to access your data. You can either use large random generation key or add more semantics to the key and try to look for natural uniqueness of the key. Functional keys approaches are back... - The human and expertise aspects again....These systems bring new paradigms. I am sure we will discuss about that another time but we, as developers, have to learn how to develop with these new paradigms. And we, as operations, have to learn how to operate and supervise these systems. Relational Databases are well known and mature mainly due to their age: over 40 years of improvements bring you the guarantee that these systems are stable and reliable! - Supervision is certainly the weakest part of these solutions which provide only few "out of the box" things. We must nonetheless notice that Riak (which is sponsored by Basho) started developing statistics tools using SNMP standards (the future will tell us if these tools will be available in the open source version). Cassandra is currently providing some statistics using JMX, which allows us to use jConsole. Moreover, JMX is supported by a lot of other tools and you can already aggregate information coming from several Cassandra nodes in tools like graphite.

To put it in a nutshell, NoSQL is about alternatives : using the most appropriate tool for each specific task. I can hear you thinking "Nothing new!" and you are right except that we have never done this in the database world which has been dominated by 40 years of Relational Databases supremacy...And moreover, NoSQL is also about Cloud computing (SimpleDB, S3...) but that's another story.