Cassandra Consistency and Failover – Part 2
In Part 1 of this article, we saw how to configure Apache Cassandra across multiple Datacenters (DCs) as well as the clients for Cassandra solution availability. This ensures that if sufficient nodes in one DC were unavailable (or the entire DC went down), clients would seamlessly failover to the other DC. When the failed DC came back up, Cassandra would stream back the missed transactions and everything would be back to normal.
In this Part 2 of the article, we will expand our outlook to consider the whole solution – not just the database to understand how failures affect the Cassandra solution availability.
Cassandra backed application
Cassandra is a highly scalable and fast database typically used for storing high volume data such as those that arise from IoT sensors, mobile games or apps or web applications with a large user base. With the DataStax Enterprise (DSE) version that bundles Spark and Solr, there are many other interesting use cases as well. But for the purposes of this article, we will keep our focus on a simple web application; the architecture for which might look like something that is shown below:
Users from across the country access a set of web servers through load balancers (not shown). The request is processed by an application server tier which accesses the backed Cassandra database to fetch and store data. Scalability is built into this architecture – as our app gains new users, we can seamlessly scale all tiers to handle the load.
Failover and Availability
As with most applications, let us assume our entire solution is deployed on cloud infrastructure like AWS. How do we ensure that our application is highly available even if an entire Availability Zone (AZ) goes down? What if an entire region (say US-West) is not accessible due to a network glitch?
In Part 1 of this article, I showed you how to setup and configure Cassandra to be able to survive a DC failure. Our architecture using multiple DCs for Cassandra could then look like the image below.
The Cassandra database is replicated across DCs so that if one goes down, the application can failover to the other DC automatically. This assumes that the application itself is highly available. But this assumption is likely flawed as if a DC goes down, most likely all tiers deployed on that DC will also go down.
But what good is it if Cassandra can failover to a different DC but the web and app tiers cannot?
Cassandra Solution Availability
We need to ensure Cassandra solution availability i.e. the entire application is highly available while ensuring that users from a particular region access only the resources within that region for maximum performance. The correct solution then, is spread the web and app server layers across the DCs. This is shown in the diagram below. Note that this still assumes that load on US-West is lower so the Cassandra US-West DC has lower number of nodes.
Each DC’s stack only accesses the Cassandra nodes within its own DC using LOCAL_QUORUM. If a DC failure occurs, DNS switches users to the web servers in the other DC. In this case, these users from the remote DC will see longer latencies but the application will be available along with all their data. Note that Cassandra will still replicate data across the DCs ensuring consistency but each app server will only access the local DC.
For true high availability, it is not sufficient to look at one piece of the solution in isolation. The entire Cassandra solution availability must be taken care of by the architecture.