Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 5: NoSQL Systems: MongoDB, Cassandra, and Redis

Relational databases gave TuranMart a trustworthy system of record for customers, products, orders, and payments. That system is still essential, but it is no longer the only storage system the company needs. The product catalog must serve flexible product attributes that differ across categories. The mobile application emits high-volume behavioral events that analysts want to slice by customer, product, region, and time. The website needs a low-latency cache for sessions, carts, feature flags, and popular catalog fragments. If the team tries to force all of these workloads into one relational schema, it will either over-normalize fast-changing records, overload the operational database, or create slow user experiences.

This chapter introduces NoSQL systems through three production patterns: MongoDB for document-oriented product data, Apache Cassandra for query-first wide-column event storage, and Redis for in-memory cache and fast operational state. The goal is not to replace relational thinking. The goal is to add a second storage vocabulary so that a data engineer can choose the right persistence model for the access pattern, consistency requirement, latency target, and operational risk.

Opening Scenario: TuranMart’s Storage Split

TuranMart’s checkout system now runs on the relational database designed in Chapter 4. The schema protects orders and payments, but three new requirements arrive within the same quarter. The merchandising team wants category-specific product attributes such as shoe size, phone memory, fabric, warranty length, and seller badges. The analytics team wants to retain every product-view and add-to-cart event for funnel analysis. The platform team wants faster reads for homepage recommendations and user sessions during campaign traffic.

The chief data engineer refuses to choose a database by fashion. Instead, the team writes down the access patterns. Product pages usually fetch one product document by SKU and render all attributes together. Event queries usually read events by customer and time window, or aggregate product activity by day. Session reads must complete in milliseconds and can tolerate reconstruction from a durable store if the cache is lost. These observations lead to a storage split: MongoDB for flexible catalog documents, Cassandra for high-write event tables designed around known queries, and Redis for cache/session state.

Chapter 5 separates product catalog documents, time-series event access, and low-latency cache state while preserving the relational order database as the system of record.

Figure 1:Chapter 5 separates product catalog documents, time-series event access, and low-latency cache state while preserving the relational order database as the system of record.

The split introduces responsibility as well as flexibility. A NoSQL system can remove relational constraints, but it does not remove the need for data contracts. A document collection still needs validation and indexes. A Cassandra table still needs a query-first primary key. A Redis cache still needs expiration, invalidation, and recovery rules. In production data engineering, NoSQL means deliberate workload-specific modeling, not schema-free improvisation.

Learning Objectives

After completing this chapter, you should be able to explain when a NoSQL system is appropriate, how MongoDB, Cassandra, and Redis differ, and how to integrate them into a governed data platform.

ObjectiveWhat you should be able to do in practice
Model documentsDesign a MongoDB product document that embeds data read together and references data that changes independently.
Design query-first tablesCreate Cassandra tables from query patterns rather than from entity-relationship diagrams.
Use cache deliberatelyChoose Redis data structures, time-to-live rules, and invalidation patterns for low-latency state.
Reason about consistencyExplain trade-offs among consistency, availability, partition tolerance, replication, and operational recovery.
Select storage systemsCompare document, wide-column, key-value, cache, relational, and analytical storage based on workload evidence.
Build a small labRun or inspect deterministic MongoDB, Cassandra, and Redis assets for a TuranMart catalog, event, and cache workflow.

5.1 Why NoSQL Exists

NoSQL systems became popular because web-scale applications exposed workloads that did not always fit a single relational design. Some records were naturally hierarchical. Some write paths produced too many events for a single primary database. Some use cases needed global distribution, tunable replication, or very low latency for simple key lookups. The phrase NoSQL is therefore best understood as a family of non-relational storage models, not as one architecture.

Definition. In this book, a NoSQL system is a workload-specific data store that relaxes some relational assumptions in order to optimize a particular access pattern, scale pattern, availability target, data shape, or latency requirement.

The word “relaxes” is important. MongoDB, Cassandra, and Redis do not eliminate structure. MongoDB documents are JSON-like BSON records that should follow application-level schema rules. Cassandra tables require careful primary-key and clustering-key design. Redis stores typed values such as strings, hashes, lists, sets, sorted sets, and streams. A professional data engineer treats these systems as different contract mechanisms, not as places where modeling discipline disappears.

Storage modelNatural fitDangerous misuse
Document databaseProduct catalog, customer profile, content metadata, configuration records.Treating every document as an ungoverned dump of arbitrary fields.
Wide-column databaseHigh-write event access, time-series lookups, known large-scale query patterns.Designing normalized relational-style tables and expecting joins.
Key-value/cacheSessions, idempotency keys, feature flags, counters, rate limits, hot fragments.Using the cache as the only durable source of important business facts.
Relational databaseOrders, payments, inventory adjustments, finance records, master data.Forcing highly variable nested documents into many sparse tables without evidence.

NoSQL decisions must begin with access patterns. Before selecting a product, ask what the application reads and writes, how much data is involved, how strict the consistency requirement is, what happens during a network partition, and how operators recover after failure. Popularity and cloud availability are useful filters only after those questions are answered.

5.2 Choosing a NoSQL Model

Different NoSQL systems optimize different physical and operational realities. MongoDB stores documents that can represent nested product attributes. Cassandra distributes rows across a cluster using partition keys and is designed for high availability and scalable writes when the query model is known in advance. Redis keeps frequently accessed state in memory and offers purpose-built data structures for fast operational workflows.

A NoSQL selection guide starts from access pattern, data shape, latency target, scale requirement, and consistency expectation rather than from product popularity.

Figure 2:A NoSQL selection guide starts from access pattern, data shape, latency target, scale requirement, and consistency expectation rather than from product popularity.

A useful selection process has five steps. First, name the business event or entity. Second, write the exact reads and writes that the application or pipeline must perform. Third, describe the failure mode: stale reads, duplicate writes, lost cache entries, and delayed reconciliation have different business costs. Fourth, decide which system owns durable truth. Fifth, document how data moves from operational storage into the lake, warehouse, or lakehouse.

QuestionMongoDB answerCassandra answerRedis answer
What is the primary data shape?Nested documents with flexible attributes.Rows grouped by partition and sorted by clustering keys.Keys mapped to typed in-memory values.
What should be modeled first?Document read and update boundaries.Queries, partition cardinality, and clustering order.Key naming, TTL, eviction, and invalidation rules.
What is the common read?Fetch one product or profile with embedded attributes.Read events for a known key and time range.Fetch or update a small hot state object.
What is the common danger?Unbounded document growth and missing indexes.Hot partitions and unsupported ad hoc queries.Silent stale data and accidental durable-state dependency.
Data engineering concernChange capture, schema drift, and catalog governance.Time bucketing, idempotent writes, and compaction behavior.Observability of hit rate, memory, evictions, and rebuild path.

The selection should also recognize organizational capabilities. A team that cannot operate a distributed database should avoid introducing Cassandra unless the scale and availability requirements justify it. A team that lacks cache observability should not place critical business logic in Redis without fallbacks. Architecture is not only about technology fit; it is also about operational maturity.

5.3 MongoDB for Flexible Product Documents

MongoDB is a document database. Its data modeling guidance emphasizes that documents should be designed around application access patterns, with related data embedded when it is commonly read together and referenced when data changes independently or would cause excessive duplication.[1] This is a different starting point from relational normalization. In MongoDB, the unit of modeling is often the aggregate document that the application wants to retrieve or update.

For TuranMart, a product page typically needs the product identity, title, category, price, seller, attributes, images, and small merchandising badges in one read. A document model can represent that shape naturally:

{
  "sku": "TM-PHONE-001",
  "title": "Samarkand X5 Smartphone",
  "category": "electronics",
  "price": { "amount": 349.00, "currency": "USD" },
  "attributes": {
    "memory_gb": 128,
    "color": "midnight blue",
    "warranty_months": 24
  },
  "seller": { "seller_id": "seller_17", "name": "Turan Mobile" },
  "status": "active",
  "updated_at": "2026-05-01T10:00:00Z"
}
MongoDB modeling embeds product attributes read with the product page while keeping independently changing operational facts in their owning systems.

Figure 3:MongoDB modeling embeds product attributes read with the product page while keeping independently changing operational facts in their owning systems.

5.3.1 Embedding and Referencing

Embedding is appropriate when child data belongs to the parent’s lifecycle and is frequently read with it. Product attributes, image metadata, and small badges are good candidates because the page renders them together. Referencing is appropriate when the related object changes independently, is shared by many documents, or grows without a predictable bound. Seller account details, inventory reservations, payments, and shipment state should usually remain in their own operational systems.

Modeling decisionPrefer embedding when...Prefer referencing when...
Product attributesAttributes are category-specific and read with the product page.Attributes are governed centrally across many products and change independently.
ImagesThe document stores a small list of URLs and alt text.Media assets have their own lifecycle, permissions, or processing pipeline.
ReviewsOnly a small summary is needed on the product page.Reviews can grow without bound and need separate moderation workflow.
InventoryA cached display value is acceptable.Quantity affects checkout correctness and must remain transactional.

Document modeling also requires index discipline. MongoDB indexes support efficient query execution, but every index has write and storage costs.[2] TuranMart should index sku for exact product lookup, category and status for catalog browsing, and perhaps selected attributes that are common filters. It should not automatically index every nested field simply because the document model allows them.

5.3.2 Schema Validation and Drift

MongoDB’s flexible document model is useful, but flexibility can become drift. A product collection that stores memory_gb, memoryGB, memory, and ram_size for the same concept will eventually harm search, analytics, and support tools. The data engineering team should publish a document contract, validate required fields, and track schema evolution just as carefully as it tracks relational migrations.

A practical pattern is to define a small required core for every product and allow category-specific extensions under attributes. Downstream pipelines can then rely on sku, category, status, price.amount, price.currency, and updated_at while specialized transformations handle category-level fields.

5.4 Cassandra for Query-First Event Storage

Apache Cassandra is a distributed wide-column database designed for high availability and scalable writes across nodes. Its data-modeling documentation states that Cassandra modeling starts with application queries rather than with a normalized entity model.[3] That rule is not optional. Cassandra does not encourage ad hoc joins across arbitrary tables. Instead, the engineer creates tables whose primary keys match the reads the application must perform.

TuranMart’s event platform needs at least two access patterns. Customer support wants to inspect a customer’s recent product views and cart actions. Product managers want daily activity for a product. A relational designer might create one normalized events table and expect indexes to solve every query. A Cassandra designer creates query-specific tables.

CREATE TABLE events_by_customer_day (
  customer_id text,
  event_day date,
  event_ts timestamp,
  event_id uuid,
  event_type text,
  product_sku text,
  region text,
  PRIMARY KEY ((customer_id, event_day), event_ts, event_id)
) WITH CLUSTERING ORDER BY (event_ts DESC, event_id ASC);

CREATE TABLE events_by_product_day (
  product_sku text,
  event_day date,
  event_ts timestamp,
  event_id uuid,
  event_type text,
  customer_id text,
  region text,
  PRIMARY KEY ((product_sku, event_day), event_ts, event_id)
) WITH CLUSTERING ORDER BY (event_ts DESC, event_id ASC);
Cassandra design duplicates event facts into query-specific tables so reads can be served by partition key and clustering order rather than by joins.

Figure 4:Cassandra design duplicates event facts into query-specific tables so reads can be served by partition key and clustering order rather than by joins.

5.4.1 Partition Keys and Clustering Keys

The partition key determines where data is placed in the cluster. The clustering keys determine the sort order within a partition. A good Cassandra table balances two forces: partitions should be narrow enough to avoid hotspots and wide enough to support efficient range reads. For event data, a common pattern is to combine an entity identifier with a time bucket, such as (customer_id, event_day) or (product_sku, event_day).

Key design elementTuranMart exampleEngineering purpose
Partition key(customer_id, event_day)Keeps one customer’s daily events together without creating an unlimited partition.
Clustering keyevent_ts DESC, event_id ASCSupports recent-first timeline reads and deterministic ordering.
Duplicate tableevents_by_product_dayServes product analytics without scanning customer partitions.
Time bucketevent_dayControls partition size and supports retention policies.

The cost is duplication. The same logical event may be written to multiple Cassandra tables. That is acceptable only if ingestion is idempotent, event identifiers are stable, and reconciliation jobs can detect missing or inconsistent rows. Cassandra rewards explicit query design and punishes vague “store now, query later” thinking.

5.4.2 Consistency and Availability

Cassandra exposes tunable consistency. In practice, a team chooses read and write consistency levels based on latency, availability, and correctness needs. A product-view event can usually tolerate eventual consistency because a delayed analytics count is not the same as a lost payment. A checkout payment should not be moved from the relational database into Cassandra merely because Cassandra writes scale well.

The CAP theorem is often summarized too casually. Brewer’s CAP framing explains why a distributed system under network partition must make trade-offs between availability and consistency.[4] The practical lesson for data engineers is not to memorize slogans such as “CP” or “AP.” The lesson is to ask which stale, delayed, duplicate, or unavailable behavior the business can tolerate for each workload.

5.5 Redis for Cache, Sessions, and Fast State

Redis is an in-memory data structure server commonly used for caching, session storage, queues, counters, leaderboards, rate limits, and short-lived coordination state. Its documentation presents Redis data types such as strings, hashes, lists, sets, sorted sets, streams, and probabilistic structures as first-class modeling tools rather than as one generic blob store.[5]

For TuranMart, Redis can improve latency without becoming the system of record. Product fragments can be cached after MongoDB reads. Session state can be stored with a TTL. A cart preview can be rebuilt from the durable checkout system if needed. Campaign counters can be kept in Redis and periodically written to analytics storage.

Redis cache-aside keeps hot reads fast while preserving MongoDB, Cassandra, and PostgreSQL as durable sources of truth for their respective workloads.

Figure 5:Redis cache-aside keeps hot reads fast while preserving MongoDB, Cassandra, and PostgreSQL as durable sources of truth for their respective workloads.

5.5.1 Data Structures and TTLs

Redis design begins with key names, value types, expiration rules, and rebuild paths. A session might use a hash at session:{session_id} with a 30-minute TTL. A product fragment might use a string key such as product:summary:{sku} with a short TTL. A campaign leaderboard might use a sorted set. A stream might buffer lightweight events before a consumer drains them.

Use caseRedis typeKey exampleExpiration rule
User sessionHashsession:sess_00130 minutes after last activity.
Product card cacheString or JSON payloadproduct:summary:TM-PHONE-0015 minutes or invalidated on catalog update.
Rate limitString counterrate:user_42:loginShort fixed window such as 60 seconds.
Campaign rankingSorted setcampaign:summer:viewsCampaign lifetime plus audit buffer.
Event handoffStreamevents:clickstreamTrim by length or time after durable ingestion.

Redis persistence options can write data to disk through point-in-time snapshots, append-only files, or a combination of both.[6] Persistence improves recovery, but it does not turn every cache design into a safe system of record. The team must still define what happens if Redis loses a key, restarts, evicts memory, or serves stale data.

5.5.2 Cache-Aside and Invalidation

The most common pattern in application data platforms is cache-aside. The application checks Redis first. On a miss, it reads from the durable store, writes a cached representation with a TTL, and returns the result. On update, the application either deletes the cache key or writes a new value. This pattern is simple, but it requires discipline. If the update path forgets to invalidate the cache, users may see stale catalog data even though the database is correct.

Data engineers care about cache behavior because cache misses, stale values, and evictions appear downstream as latency spikes, duplicated events, or confusing analytics. Production teams should monitor hit rate, memory usage, eviction count, key cardinality, and rebuild latency.

5.6 Consistency, Replication, and Failure Thinking

NoSQL systems often move trade-offs that relational databases hide behind one default configuration into explicit engineering choices. MongoDB replica sets replicate data for high availability. Cassandra distributes data across nodes and supports tunable consistency. Redis can replicate, persist, and cluster, but many deployments use it primarily for low-latency volatile state. Each choice changes the meaning of “write succeeded,” “read is current,” and “data is recoverable.”

Failure questionMongoDB catalogCassandra eventsRedis cache
What is the durable truth?Product document collection, backed by operational governance.Event log/table set with stable event IDs and replay strategy.Usually another store; Redis holds hot derived state.
What stale read is tolerable?Short catalog delay may be acceptable; price and availability may need stronger controls.Delayed event visibility is usually acceptable for analytics.Stale cache should be bounded by TTL and invalidation rules.
What duplicate is tolerable?Duplicate SKU is not acceptable; enforce uniqueness.Duplicate events are common enough to require idempotent identifiers.Duplicate cache set is usually harmless; duplicate side effects are not.
What recovery path is required?Backup, restore, and change stream/reload plan.Replay from source events and repair/reconciliation jobs.Rebuild from durable stores and warm critical keys.

A strong data platform therefore records ownership. PostgreSQL owns orders and payments. MongoDB owns flexible catalog documents. Cassandra owns query-optimized event projections, preferably fed from a durable event ingestion path. Redis owns hot derived state that can be rebuilt. When ownership is unclear, incidents become political arguments rather than engineering recoveries.

5.7 NoSQL in the Data Engineering Platform

NoSQL systems rarely live alone. They feed and are fed by pipelines. MongoDB product changes may be captured into an object store or lakehouse for analytics. Cassandra event tables may be created from Kafka topics and compacted into daily Parquet files. Redis counters may be periodically flushed into durable storage for reporting. These flows create governance requirements: schema contracts, lineage, access controls, retention policies, and quality checks.

NoSQL systems become data-engineering assets when their operational roles, ingestion paths, analytical exports, quality checks, and ownership boundaries are explicit.

Figure 6:NoSQL systems become data-engineering assets when their operational roles, ingestion paths, analytical exports, quality checks, and ownership boundaries are explicit.

A practical governance pattern is to publish a storage decision record for every new database. The record should name the workload, owner, access patterns, data contract, retention rule, backup or rebuild path, monitoring signals, and downstream consumers. This discipline prevents the common anti-pattern where a team introduces a NoSQL system for speed and leaves future data engineers to reverse-engineer what the data means.

Governance fieldExample for TuranMart
WorkloadFlexible product catalog documents for the web and mobile storefront.
OwnerCatalog platform team, with data engineering as downstream steward.
ContractRequired fields: sku, title, category, status, price, updated_at; category attributes under attributes.
Downstream flowDaily product snapshot to object storage and incremental changes to analytics models.
RetentionKeep active and recently inactive product versions; archive long-term history in the lake.
RecoveryRestore MongoDB backup, replay approved catalog changes, rebuild Redis product cache.
MonitoringIndex use, slow queries, document growth, cache hit rate, Cassandra partition size, ingestion lag.

5.8 Guided Lab: Product Catalog, Event Tables, and Cache Layer

The guided lab for this chapter is stored in shared/labs/ch05_nosql_systems. It gives you deterministic starter assets for three related storage tasks: a MongoDB product catalog, Cassandra query-first event tables, and Redis cache/session keys. The lab is intentionally small enough to inspect without running heavyweight services, but it also includes a Docker Compose file for teams that want hands-on local experimentation.

5.8.1 Lab Goal

By the end of the lab, you will have mapped TuranMart’s storage requirements to concrete NoSQL models. You will load or inspect product documents, create Cassandra tables for two event access patterns, define Redis cache keys with TTLs, and validate that expected outputs match the starter data.

5.8.2 Lab Files

PathPurpose
shared/labs/ch05_nosql_systems/docker-compose.ymlStarts MongoDB, Cassandra, and Redis for local experimentation.
shared/labs/ch05_nosql_systems/data/products.jsonlDeterministic product catalog documents used by MongoDB examples.
shared/labs/ch05_nosql_systems/data/events.csvDeterministic product-view and cart events used by Cassandra modeling exercises.
shared/labs/ch05_nosql_systems/mongo/catalog_setup.jsMongoDB collection setup, validation, indexes, and seed data.
shared/labs/ch05_nosql_systems/cassandra/turanmart_events.cqlCassandra keyspace and query-specific event tables.
shared/labs/ch05_nosql_systems/redis/session_cache.redisRedis cache/session commands with TTL examples.
shared/labs/ch05_nosql_systems/expected_output/product_activity_by_day.csvExpected event aggregation used by the validator.
shared/labs/ch05_nosql_systems/tests/validate_lab_assets.pyLightweight validation script for files, model choices, and deterministic outputs.
shared/solutions/ch05_nosql_systems/solution.mdReference solution and interpretation notes.

5.8.3 Quick Start

From the repository root, first validate the lab package without running services:

python shared/labs/ch05_nosql_systems/tests/validate_lab_assets.py

If Docker is available and you want to experiment with the services, start them with:

docker compose -f shared/labs/ch05_nosql_systems/docker-compose.yml up -d

Then inspect the three workload models. The MongoDB script defines the product document contract and indexes. The Cassandra CQL file creates query-specific tables for events by customer and by product. The Redis command file demonstrates session hashes, product summary caching, TTLs, and counters.

5.8.4 Expected Output

The validator should print a deterministic success summary similar to the following:

Chapter 5 lab validation passed.
Products: 5
Events: 12
Product-day rows: 7
Redis TTL examples: 3

5.8.5 Completion Checklist

CheckExpected result
Product documents inspectedRequired fields and category-specific attributes are clear.
MongoDB indexes reviewedsku is unique, and common browsing filters are indexed deliberately.
Cassandra tables understoodTwo query-specific tables serve customer-day and product-day reads.
Redis keys reviewedSession, product summary, rate-limit, and counter examples use explicit TTLs where appropriate.
Expected output validatedProduct-day event counts match expected_output/product_activity_by_day.csv.
Trade-off documentedYour notes state which system owns durable truth and which values can be rebuilt.

5.8.6 Cleanup

When you finish a Docker-based run, stop the services and remove local runtime state:

docker compose -f shared/labs/ch05_nosql_systems/docker-compose.yml down -v

Common Pitfalls

The first pitfall is treating NoSQL as an excuse to avoid modeling. Flexible documents still need contracts. Cassandra still needs query-specific tables. Redis still needs key design. The absence of a relational schema does not remove responsibility; it moves responsibility into application contracts, validation rules, and operational practice.

The second pitfall is using one database for every workload. MongoDB can store events, Cassandra can store profile-like rows, and Redis can hold JSON strings, but capability does not equal suitability. The right question is not “Can this database store the data?” but “Can this database serve the access pattern safely and operably?”

The third pitfall is ignoring rebuild paths. A cache without a rebuild path becomes a hidden system of record. A duplicated Cassandra projection without replay becomes a reconciliation problem. A document collection without export becomes an analytical blind spot. Every NoSQL design should include the phrase “If this store fails, we recover by...”

Mini-Capstone: Storage Decision Record for TuranMart

Create a one-page storage decision record for TuranMart’s catalog, event, and cache layer. Your record should include the workload, selected system, access patterns, data contract, consistency expectations, monitoring signals, and recovery path.

RequirementAcceptance criterion
Workload mappingCatalog, events, and cache/session state are mapped to MongoDB, Cassandra, and Redis with justification.
Access patternsAt least two reads and one write are documented for each system.
Consistency statementThe record states which stale reads, duplicate writes, or missed updates are tolerable.
Recovery pathEach system has a backup, replay, or rebuild plan.
Downstream integrationThe record explains how data enters the analytical platform.

Exercises

  1. Extend the MongoDB product model with a reviews_summary field. Decide whether full reviews should be embedded or referenced, and justify your choice.

  2. Add a Cassandra table for events_by_region_day. Choose a partition key and clustering keys, then explain how you would prevent a hot partition during a national campaign.

  3. Design Redis keys for a cart preview. Include key names, data structures, TTLs, and invalidation rules after checkout.

  4. Write a data quality check that detects product documents missing price.currency or events with an unknown product_sku.

  5. Compare a relational and document model for product attributes. Identify one query that becomes easier in MongoDB and one governance task that becomes harder.

Review Questions

  1. Why is “schema-free” a misleading description of production document databases?

  2. What does query-first modeling mean in Cassandra, and why is it different from relational normalization?

  3. When should Redis be treated as a cache rather than a durable system of record?

  4. How do partition keys and time buckets help control Cassandra event-table design?

  5. What should a data engineer document before introducing a new NoSQL store into a production platform?

Summary

NoSQL systems are not a replacement for relational databases. They are a set of specialized storage models that solve different workload problems. MongoDB helps when records are naturally document-shaped and read as aggregates. Cassandra helps when high-volume distributed writes must support known query patterns. Redis helps when hot operational state must be served with very low latency and can be expired, invalidated, or rebuilt.

The professional habit is to model the workload first. Define the access pattern, consistency expectation, ownership boundary, recovery path, and downstream integration before choosing the technology. When NoSQL systems are introduced with that discipline, they become a powerful part of the data engineering platform rather than a source of unmanaged complexity.

References

Footnotes