Senthilkumar Gopal

Musings of a machine learning engineer

Eight steps in choosing a database

We are planning a database architecture rewrite for an existing service and the needs of the project are:

  1. Consistent data with ACID compliance for LIVE data
  2. Async writes and random reads for metadata
  3. Fast and Async writes and very low reads for audit information

Usually, experts suggest a multi-database solution, or polyglot persistence approach instead of a traditional monolithich RDBMS solution.

Disadvantages of RDBMS

Scaling a traditional RDBMS is difficult at best. Partitioning schemes, multi-master configurations, and redundancy systems offered by Oracle, SQL Server, and DB2 are expensive and problematic at best. They often fail to meet the needs of high-scale applications. Also, for short lived data and with different lifetime data, RDBMS does not fit the needs of such an application.

Available Databases

Some of the database types available as per this Dzone article are:

Key-Value stores

The most commonly used database solutions are Key-Value stores such as Aerospike, Redis, and Riak. These are available for easier install and application implementation.

Column-Family databases

The other common databases available are Cassandra and HBase are both based on Hadoop but have different write semantics. While HBase offers strong write integrity and Cassandra offers eventual consistency. Also, Cassandra is ideal for high intense writes and random reads.

Document Databases

Document databases scale quite well and are great for web-based operational systems that operate on a single big entity, or systems that don’t require transactional integrity across entities. Typically, MongoDB and Couchbase are typically the leaders in this sector.

Graph Databases

Social networks and recommendation systems are classic use cases for graph databases, but there are a few different types of graph databases. Some of them are custom made for operational purposes (Neo4j) while others are aimed more at analytics (Apache Giraph).

Using the above descriptions and also based on availability or support within our organization, the following database types were determined.

Database Storage for
Oracle Partition scheme. Store Live data.
Cassandra Heavy writes for Async metadata
Cassandra Heavy writes for Audit log