What's new in Apache Cassandra 0.7+

le 17/05/2011 par Jordan Pittier

It's been a while since we last blogged about Apache Cassandra. Let's catch up with the new features available from version 0.7+.

Online schemas updates

Before version 0.7 Keyspaces and ColumnFamilies had to be described in the configuration file of Cassandra. Adding/removing/updating a Keyspace or a ColumnFamily required a rolling cluster update. This is no longer necessary with Cassandra 0.7 thanks to new methods added to the API. Schemas can now be changed without restarting the cluster.

Here is a small example with the CLI :

11:55 root@sd-28364 ~ # cassandra-cli -h localhost
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
[default@unknown] create keyspace myTestKS with replication_factor = 1;
[default@unknown] use myTestKS;
[default@myTestKS] create column family myTestCF with comparator = UTF8Type and rows_cached = 100;

Secondary indexes

Secondary indexes are indexes on columns value. It's now possible to fetch rows, not only based on their row key, but also by the value of their indexed columns. Prior to version 0.7 "inverted indexes" had to be maintened manually, which was cumbersome and "non-atomic". This feature eases the writting of advanced queries.

Here is an example based on the "myTestCF" columns familly set in the previous CLI session :

[default@myTestKS] update column family myTestCF with column_metadata=[{column_name: myIntColumn, validation_class: LongType, index_type: KEYS}];
[default@myTestKS] set myTestCF['key1']['myIntColumn'] = 2;
[default@myTestKS] get myTestCF where myIntColumn = 2;
-------------------
RowKey: key1
=> (column=myIntColumn, value=2, timestamp=1305288027744000)
1 Row Returned.

For more details about Secondary Indexes, see the Apache Cassandra Wiki and this good blog post by Jonathan Ellis.

Other Changes

There is a lot of other changes, the exhaustive changelog is worth reading. Among the changes :

Support of expiring columns. Columns can be inserted along with a Time To Live. The column will automatically be marked for deletion after the TTL has expired.
Numerous performance optimization including read repair and more efficient use of row cache.
Some API changes. It is now considered best practice to use third party libraries such as Hector (in Java) or Pycassa (in Python) to query Cassandra instead of using "low level" Thrift objects.

All in all, on a personal note, Cassandra seems production ready for "the rest of us". But wait, there is more to come with the promising 0.8 branch scheduled in a couple of days :

CQL

Cassandra Query Language is a basic SQL-like language for data management in Cassandra. It will be possible to write queries like these :

SELECT column1,column2 FROM myTestCF using CONSISTENCY.QUORUM where KEY > "aaa" AND "column1" = 5
UPDATE myTestCF SET column1 = value1, column2 = value2 WHERE KEY="aaa"
DELETE column1 FROM myTestCF using CONSISTENCY.ALL WHERE KEY IN ("aaa", "aab");

Distributed Counters

Those are columns of type "Long" that can be incremented or decremented. Yeah, that's right, they are just counting, but they are doing it fast. Counters have long been awaited in Cassandra and they seem to fit ideally "realtime analytics" . See this presentation by Twitter on their use of counters.

Performance enhancement of Compaction

Starting with version 0.8+ sstables compactions are multithreaded. Compactions should complete faster. Several compactions are done in parallel which should bound the proliferation of new sstables while a previous long compaction is taking place. This should be really helpful for write-heavy workload.

Datastax's product

Recently, Datastax (commercial leader in Apache Cassandra) have announced two new products. First is OpsCenter (free for non-production use only), an administration console for managing, monitoring and operating Cassandra and Hadoop cluster. Then is Brisk, an open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services.

Looking forward to it !