Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 5: NoSQL Databases: MongoDB and Cassandra

In the previous chapter, we took a deep dive into the world of relational databases, the bedrock of data management for decades. However, the rise of big data, with its massive volume, high velocity, and complex variety, has exposed the limitations of the relational model. The rigid schemas, the difficulty of horizontal scaling, and the overhead of ACID transactions that make relational databases so reliable for transactional workloads can become a bottleneck when dealing with web-scale applications and massive datasets. This led to the rise of a new class of databases, collectively known as NoSQL.

NoSQL, which is often interpreted as “Not Only SQL,” is not a single product but rather a broad movement that encompasses a wide variety of different database technologies. What they all have in common is that they were designed to solve the challenges of scalability, performance, and flexibility that were difficult to address with traditional relational databases. In this chapter, we will explore the world of NoSQL. We will understand the fundamental principles that guide NoSQL database design, such as the CAP theorem and the BASE consistency model. We will then take a deep dive into two of the most popular open-source NoSQL databases: MongoDB, the leading document database, and Apache Cassandra, a massively scalable wide-column store. We will also briefly look at Redis, the ubiquitous in-memory data store. By the end of this chapter, you will understand when and why to choose a NoSQL database and have the practical knowledge to start using them effectively.

5.1 The NoSQL Movement: A New Way of Thinking About Data

To understand NoSQL, we must first understand the problems it was designed to solve. In the early 2000s, companies like Google, Amazon, and Facebook were dealing with data at a scale that the world had never seen before. They found that scaling their relational databases to handle millions of users and petabytes of data was becoming increasingly difficult and expensive.

The Challenges of Scaling Relational Databases

This led the web-scale pioneers to develop their own, non-relational database systems that were designed from the ground up for horizontal scalability. Google created Bigtable, Amazon created Dynamo, and the papers they published about these systems in the mid-2000s laid the foundation for the NoSQL movement.

The CAP Theorem: A Fundamental Trade-off

In 2000, computer scientist Eric Brewer proposed the CAP theorem, which states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

This means that in the real world, a distributed database must choose between Consistency and Availability when a network partition occurs. This is the fundamental trade-off of distributed systems.

Traditional relational databases are typically CA (Consistent and Available) systems, but they are not partition-tolerant. They are designed to run on a single, reliable server. Most NoSQL databases, on the other hand, are designed to be distributed and partition-tolerant, which means they must make a choice between C and A.

BASE vs. ACID: A Different Set of Guarantees

This trade-off between consistency and availability leads to a different consistency model than the strict ACID guarantees of relational databases. Many NoSQL databases are designed around the principles of BASE:

BASE is a more relaxed consistency model than ACID. It prioritizes availability over immediate consistency, which is often an acceptable trade-off for web-scale applications. For example, if a user updates their profile picture on a social media site, it is okay if it takes a few seconds for that change to be visible to all their friends. It is more important that the site remains available.

Types of NoSQL Databases

NoSQL is not a single technology but a family of different database types, each with its own data model and use cases.

5.2 MongoDB: The Leading Document Database

MongoDB is the most popular document database and one of the most popular NoSQL databases overall. It was designed to be a flexible, scalable, and easy-to-use database for modern applications.

The Document Data Model

MongoDB stores data in documents, which are JSON-like structures with a flexible schema. A collection of documents is called a collection, which is analogous to a table in a relational database.

// A document in a "users" collection
{
  "_id": ObjectId("636a8f29e4b0e3b2f1a2b3c4"),
  "username": "alice",
  "email": "alice@example.com",
  "address": {
    "street": "123 Main St",
    "city": "Anytown"
  },
  "interests": ["data engineering", "climbing"],
  "last_login": ISODate("2025-11-08T10:00:00Z")
}

This flexible data model is one of MongoDB’s key strengths. You can have documents with different fields in the same collection, and you can easily represent nested or hierarchical data. This makes it a natural fit for object-oriented programming and for handling the semi-structured data that is common in modern applications.

Architecture: Replica Sets and Sharding

MongoDB was designed for scalability and high availability.

Querying and Indexing

MongoDB has a rich query language that allows you to perform complex queries, including filtering, sorting, and aggregation. Queries are expressed as JSON documents.

// Find all users who are interested in data engineering
db.users.find({ interests: "data engineering" })

// Find all users in Anytown and sort by username
db.users.find({ "address.city": "Anytown" }).sort({ username: 1 })

To ensure good query performance, MongoDB supports a variety of index types, including single-field, compound, multi-key (for arrays), and geospatial indexes.

Use Cases

MongoDB is a versatile database that is well-suited for a wide range of use cases:

5.3 Apache Cassandra: The Master of Scale

Apache Cassandra is an open-source, distributed, wide-column store database that is designed for massive scalability and high availability with no single point of failure. It was originally developed at Facebook to power their inbox search feature and was later open-sourced.

The Wide-Column Data Model

Cassandra’s data model can be thought of as a multi-dimensional map. Data is stored in tables, which have rows and columns. However, unlike a relational database, different rows in the same table can have different sets of columns. Each row is uniquely identified by a primary key.

// A conceptual view of a table in Cassandra
CREATE TABLE user_activity (
    user_id uuid,
    timestamp timeuuid,
    activity_type text,
    product_id uuid,
    PRIMARY KEY (user_id, timestamp)
);

// Row 1
user_id: 123, timestamp: 2025-11-08 10:00, activity_type: "view_product", product_id: 456

// Row 2
user_id: 123, timestamp: 2025-11-08 10:01, activity_type: "add_to_cart", product_id: 456

Architecture: The Masterless Ring

Cassandra’s architecture is its key differentiator. It is a masterless, peer-to-peer system where all nodes in the cluster are equal. There is no primary/secondary distinction like in MongoDB.

This masterless architecture makes Cassandra incredibly resilient. You can lose nodes in the cluster without losing availability, and you can add new nodes to the cluster to scale linearly.

Data Modeling: Query-First Design

Data modeling in Cassandra is very different from relational modeling. You don’t start with an ERD and normalize it. Instead, you follow a query-first approach. You start by identifying the queries you will need to run and then design your tables to answer those queries efficiently. This often involves creating multiple, denormalized tables to support different query patterns.

Use Cases

Cassandra’s strengths in scalability, availability, and write throughput make it ideal for:

5.4 Redis: The Swiss Army Knife of In-Memory Stores

Redis (Remote Dictionary Server) is an open-source, in-memory, key-value data store. While it can be used as a primary database, it is most often used as a high-performance cache, message broker, or session store.

In-Memory Performance

Redis’s key feature is that it stores all its data in memory. This makes it incredibly fast, with typical read and write operations taking less than a millisecond.

Rich Data Structures

Redis is more than just a simple key-value store. It supports a rich set of data structures, including:

These data structures allow you to build complex applications directly in Redis.

Use Cases

Redis is a versatile tool that can be used for a wide variety of use cases:

5.5 Choosing the Right NoSQL Database

With so many different NoSQL databases to choose from, how do you pick the right one for your use case? Here is a simple decision framework:

Often, the answer is not to choose a single database but to use multiple databases in a polyglot persistence architecture, where you use the right database for the right job. For example, you might use PostgreSQL for your core transactional data, MongoDB for your product catalog, Cassandra for your user activity logs, and Redis for caching.

Chapter Summary

In this chapter, we have journeyed beyond the world of relational databases and into the diverse and powerful world of NoSQL. We have understood the fundamental principles that drive the NoSQL movement, including the challenges of scaling relational databases, the trade-offs of the CAP theorem, and the flexibility of the BASE consistency model. We have taken a deep dive into three of the most important open-source NoSQL databases: MongoDB, the leading document database; Apache Cassandra, the master of scale; and Redis, the Swiss army knife of in-memory stores. You should now have a solid understanding of the different types of NoSQL databases, their architectures, their data models, and their ideal use cases. You are now equipped to make informed decisions about when to use a relational database and when to reach for a No-SQL alternative.

In the next chapter, we will continue our exploration of data storage by looking at the foundation of the modern data platform: object storage and the data lake.