Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 6: Object Storage and Data Lakes with MinIO and S3 APIs

Opening Scenario: TuranMart’s Lake Before the Lakehouse

TuranMart’s storage platform has grown from a well-normalized order database in Chapter 4 into the polyglot system introduced in Chapter 5. PostgreSQL still owns payments and orders, MongoDB stores flexible product content, Cassandra keeps high-volume events, and Redis accelerates hot reads. The next problem is not where one transaction should live. The next problem is how the company should keep all analytical evidence—raw exports, application logs, clickstream events, catalog snapshots, model features, support transcripts, and finance extracts—without forcing every source into one database design.

The analytics team wants a place where every operational system can land data cheaply and durably. Data scientists want raw history so they can reprocess experiments when feature logic changes. Compliance officers want retention rules, encryption, access boundaries, and auditability. Platform engineers want storage that survives growth without making them resize disks every quarter. This is the moment when object storage becomes the foundation and the data lake becomes the pattern.

Chapter promise: By the end of this chapter, you will design and validate a Bronze-Silver-Gold data lake on top of S3-compatible object storage. You will understand object keys, prefixes, file formats, lifecycle controls, MinIO local development, and the governance decisions that keep a lake from becoming a swamp.

TuranMart’s object-storage data lake uses S3-compatible APIs as a durable storage substrate while compute engines, catalogs, governance controls, and consumers evolve independently.

Figure 1:TuranMart’s object-storage data lake uses S3-compatible APIs as a durable storage substrate while compute engines, catalogs, governance controls, and consumers evolve independently.

Learning Objectives

By the end of this chapter, you should be able to design and explain an object-storage-backed data lake, build a small reproducible lake layout, and defend the governance choices that make the lake usable over time.

ObjectiveWhat you should be able to doEvidence in this chapter
Explain object storageCompare object, file, and block storage using access model, metadata, scale, and latency.Storage model comparison and design tables.
Design object keys and prefixesCreate a predictable lake layout for raw, cleaned, and business-ready zones.TuranMart path conventions and guided lab.
Build with S3-compatible APIsUse MinIO locally as an S3-compatible development target and map the same concepts to Amazon S3 or Alibaba Cloud OSS.MinIO/S3 architecture section and Docker Compose lab.
Validate lake quality layersTransform raw JSON/CSV inputs into cleaned Parquet and business aggregates.Bronze-Silver-Gold guided lab.
Govern the lakeApply catalog, access, encryption, quality, and lifecycle controls before the lake becomes a swamp.Governance control model and pitfalls.
Troubleshoot lake operationsDiagnose small-file problems, partition mistakes, schema drift, and misplaced permissions.Common pitfalls, review questions, and lab checklist.

6.1 Why Object Storage Became the Lake Foundation

Object storage stores data as independent objects inside buckets. Each object contains the bytes of the file, metadata, and a unique object key. Unlike a file system, the storage system does not maintain a true directory tree; folder-like paths such as bronze/orders/ingest_date=2026-05-30/orders.jsonl are key prefixes that humans and tools interpret as hierarchy. Alibaba Cloud OSS describes the same model directly: objects live in buckets, each object has a key, metadata, and data, and the object namespace is flat rather than a traditional file hierarchy.[1]

This model fits data engineering because analytical storage grows in size, source diversity, and retention complexity. Amazon S3 Standard is designed to store objects redundantly across at least three Availability Zones and is designed for 99.999999999% durability over a year.[2] Alibaba Cloud OSS documents twelve-nines durability, strong consistency for object operations, platform-independent APIs, storage classes, and object-level features such as versioning, bucket policies, encryption, replication, and lifecycle rules.[1] These properties make object storage a practical landing zone for data that is too large, too varied, or too long-lived for an operational database.

Definition: A data lake is a centralized repository that stores structured, semi-structured, and unstructured data at scale, usually in raw form first, then applies schema and interpretation when compute engines read the data. Alibaba Cloud’s data lake guidance highlights this schema-on-read pattern and the decoupling of compute from storage.[3]

TermPractical meaningWhy it matters to a data engineer
BucketTop-level container for objects.Defines broad ownership, region, policy, encryption, lifecycle, and replication boundaries.
Object keyUnique path-like name inside a bucket.Drives organization, partition pruning, naming conventions, and operational discoverability.
PrefixLeading portion of the object key, often shown like a folder.Enables grouping by zone, domain, source, table, and date, even though the namespace is flat.
MetadataSystem and user-defined attributes attached to an object.Supports governance, lineage, content type, retention, and automation.
Storage classCost/performance tier for objects.Aligns hot, warm, archive, and compliance data with different cost profiles.
Lifecycle ruleAutomated transition or expiration policy.Keeps old raw data from silently becoming a permanent cost center.

Object storage is not a magic replacement for every storage system. It is excellent for large immutable files, append-oriented analytical data, backups, media, logs, model artifacts, and lakehouse table storage. It is usually a poor fit for low-latency row updates, POSIX-heavy applications, and workloads that require millions of tiny random writes per second. The important design move is to treat object storage as the durable analytical substrate, not as a direct clone of a local disk.

Object, file, and block storage solve different engineering problems; a data lake normally begins with object storage because it scales through APIs and decouples storage from compute.

Figure 2:Object, file, and block storage solve different engineering problems; a data lake normally begins with object storage because it scales through APIs and decouples storage from compute.

6.2 Object, File, and Block Storage in Practice

The easiest way to misuse object storage is to assume it behaves like a mounted folder. File storage provides shared hierarchical directories through protocols such as NFS or SMB. Block storage exposes raw volumes to operating systems and databases, which then manage file systems and random I/O on top. Object storage exposes an HTTP API for whole-object operations such as PUT, GET, DELETE, listing, versioning, and multipart upload. This difference changes how you design applications and pipelines.

Design questionObject storageFile storageBlock storage
What is stored?Objects addressed by key.Files and directories.Raw blocks on volumes.
Common access styleHTTP/S3-compatible APIs and SDKs.NFS, SMB, POSIX-like interfaces.OS-level disk and database I/O.
Scaling patternAdd capacity behind the service; clients keep using APIs.Scale file servers or distributed file systems.Resize or add volumes attached to compute.
Typical latency profileHigh throughput, higher per-operation latency.Lower metadata/file-operation latency.Lowest latency and high IOPS for attached workloads.
Best fitData lakes, archives, media, backups, ML artifacts, lakehouse tables.Shared directories, home folders, legacy analytics.Databases, transaction logs, virtual machine disks.
Design cautionAvoid tiny-file explosions and local-file assumptions.Watch metadata bottlenecks and namespace limits.Avoid treating it as shared analytical storage.

Object storage has also become a portable interface, not just a product category. Amazon S3 popularized the API shape, and many storage platforms now expose S3-compatible operations. MinIO documents support for S3-compatible object and bucket APIs, including object operations, multipart uploads, lifecycle, notifications, policies, and replication, while also documenting where behavior differs from Amazon S3.[4] This makes MinIO especially useful for local development, private-cloud deployments, and teaching labs because the same conceptual model transfers to managed cloud object stores.

6.3 S3-Compatible APIs and MinIO for Local Development

MinIO gives TuranMart’s engineers a local S3-compatible target that can run in Docker during development. A developer can create a bucket, upload objects with familiar SDKs, validate prefix conventions, and test lake transformations without provisioning a cloud account. Production may use Amazon S3, Alibaba Cloud OSS, Azure Blob Storage, Google Cloud Storage, or a self-managed MinIO cluster, but the lab interface remains intentionally close to S3.

MinIO’s operational documentation describes deployments as one or more minio server nodes acting as a single object storage repository. It can run on bare metal, virtual machines, Docker, Podman, or Kubernetes, and can be deployed as a single-node development server or a distributed multi-node cluster.[5] That does not mean a laptop container has production durability. It means the laptop can teach API semantics, object layout, and pipeline behavior before cloud-specific configuration enters the picture.

MinIO lets developers practice S3-compatible bucket, object, multipart upload, policy, and lifecycle concepts locally before deploying the same lake pattern to managed cloud object storage.

Figure 3:MinIO lets developers practice S3-compatible bucket, object, multipart upload, policy, and lifecycle concepts locally before deploying the same lake pattern to managed cloud object storage.

PlatformDeployment modelStrong use caseDesign caution
Amazon S3Managed AWS object storage.Mature ecosystem, broad SDK support, durable cloud-native lakes.Request, retrieval, cross-region, and small-file costs need modeling.
Alibaba Cloud OSSManaged Alibaba Cloud object storage.OSS-based lake with MaxCompute, EMR, PAI, Spark, and OSS-HDFS integration.[3]Region, endpoint, RAM policy, and data-transfer design should be explicit.
MinIOSelf-managed S3-compatible object storage.Local labs, private cloud, hybrid environments, and portable development.The operator owns capacity, redundancy, upgrades, monitoring, and recovery.
Azure Blob StorageManaged Azure object storage.Microsoft-centered analytics and identity environments.APIs and governance patterns differ from pure S3 compatibility.
Google Cloud StorageManaged Google Cloud object storage.GCP analytics, AI, and BigQuery-centered systems.Portability still requires abstracted configuration and open formats.

A good platform team writes lake code so that environment-specific values are configuration, not logic. Bucket names, endpoints, access keys, regions, SSL settings, and prefix roots should be injected through environment variables or secrets. Data layout, file contracts, quality checks, and table semantics should remain stable across environments.

6.4 Designing Object Keys, Prefixes, and Lake Zones

A data lake is easier to govern when the object key tells a clear story. TuranMart will use a convention that includes zone, domain, dataset, version, and partition columns. A raw order export might land at bronze/commerce/orders/source=postgres/ingest_date=2026-05-30/batch_id=001/orders.jsonl. A cleaned Parquet dataset might be written to silver/commerce/orders_clean/order_date=2026-05-30/part-0000.parquet. A BI-ready aggregate might appear under gold/commerce/daily_sales/order_date=2026-05-30/part-0000.parquet.

The point is not that every organization must copy this exact path. The point is that prefix design should encode ownership and query patterns without becoming a dumping ground for every detail. Put stable domain and dataset identifiers near the front. Put partition columns after the dataset name. Use lowercase, predictable separators, and avoid embedding secrets or personally identifiable information in object keys because object keys often appear in logs, catalogs, error messages, and access reports.

Prefix componentExampleDecision rule
Zonebronze, silver, goldUse zone names to communicate data quality and consumer readiness.
Domaincommerce, marketing, financeAlign with business ownership rather than source-system accidents.
Datasetorders, clickstream, daily_salesKeep names stable and meaningful to downstream users.
Sourcesource=postgres, source=webUse when raw data from multiple origins lands in the same domain.
Partitionorder_date=2026-05-30Partition by common filter columns, not by every available field.
Batch or runbatch_id=001Use for idempotency, lineage, and reprocessing.

Two common mistakes appear early. The first is to make prefixes mirror the structure of the source system too closely, such as exports/db1/table_42/tmp/final2. That path becomes meaningless once the source changes. The second is to over-partition, such as writing one partition per customer, product, hour, and campaign. Query engines then spend more time planning and listing files than reading useful data. Prefix design should serve the lake’s future readers, not only the first ingestion script.

6.5 From Data Lake to Medallion Architecture

A data lake needs quality layers because raw data and business-ready data have different responsibilities. The medallion architecture organizes data into Bronze, Silver, and Gold layers. Azure Databricks describes this pattern as a way to progressively improve data structure and quality as data moves from Bronze to Silver to Gold.[6]

The medallion architecture improves trust step by step: Bronze preserves evidence, Silver validates and conforms records, and Gold publishes business-ready outputs.

Figure 4:The medallion architecture improves trust step by step: Bronze preserves evidence, Silver validates and conforms records, and Gold publishes business-ready outputs.

LayerPrimary responsibilityTypical contentsTuranMart example
BronzePreserve source evidence with minimal transformation.Raw JSON, CSV, logs, API responses, ingestion metadata.Daily order exports and clickstream files exactly as received.
SilverClean, validate, type, deduplicate, and conform.Parquet tables with enforced schema and data quality checks.Valid orders with typed timestamps, positive amounts, and consistent product IDs.
GoldPublish business-ready data products.Aggregates, marts, metrics, feature tables, and dashboard-ready datasets.Daily revenue by channel and product category for BI reporting.

Bronze is not low-quality because engineers are careless. Bronze is intentionally close to the source because it preserves evidence for audit, reprocessing, and recovery. Silver is where the platform states its minimum truth contract: types are valid, required fields exist, duplicates are handled, sensitive columns are classified, and invalid records are quarantined rather than silently dropped. Gold is where business logic becomes explicit: revenue definitions, attribution rules, time windows, fiscal calendars, and dimensional models belong here.

Apache Parquet is a common file format for Silver and Gold because it is an open columnar format with an Apache-maintained specification and developer documentation.[7] Columnar layout helps analytical engines read only relevant columns and compress repeated values efficiently. However, Parquet alone does not provide table transactions, concurrent writes, or schema evolution guarantees. Those topics lead naturally into Chapter 7, where lakehouse table formats bring transactional behavior to data lake storage.

6.6 Governance: Preventing the Data Swamp

A lake becomes a swamp when people can put data in but cannot reliably find, understand, secure, trust, or retire it. Governance should not be added after chaos appears. It should begin with the first bucket, the first prefix convention, and the first lab dataset.

Governance keeps the lake usable by connecting metadata, quality, access, encryption, lifecycle, and observability controls to every zone.

Figure 5:Governance keeps the lake usable by connecting metadata, quality, access, encryption, lifecycle, and observability controls to every zone.

Governance pillarWhat must be decided earlyExample implementation
Metadata and catalogWho owns each dataset, what schema it has, where it came from, and who uses it.Data catalog entries with owners, descriptions, schemas, freshness, and lineage.
Data qualityWhich checks block promotion from Bronze to Silver and from Silver to Gold.Not-null, uniqueness, accepted values, schema checks, and quarantine tables.
Access controlWhich teams can read raw, cleaned, sensitive, and aggregated data.Bucket policies, IAM/RAM roles, access points, separate buckets, and least privilege.
Encryption and privacyWhich data must be encrypted, masked, tokenized, or isolated.Server-side encryption, key management, PII tagging, and masked Gold views.
Lifecycle and retentionHow long raw, cleaned, archive, and derived data should remain.Lifecycle transition and expiration rules. AWS S3 lifecycle rules can transition objects to lower-cost classes or delete expired objects automatically.[8]
ObservabilityHow platform teams detect ingestion failures, stale datasets, excessive cost, and unusual access.Object inventory, metrics, alerts, query-cost dashboards, and audit logs.

Governance also shapes how teams use Alibaba Cloud OSS, Amazon S3, or MinIO differently. Alibaba’s OSS data lake documentation emphasizes lifecycle rules, storage classes, access points, bucket inventory, replication, QoS controls, and multiple access interfaces such as SDKs, Hadoop connectors, OSS-HDFS, ossfs, ossutil, and ossbrowser.[3] The architectural lesson is broader than one cloud: object storage gives the durable substrate, but platform controls make that substrate safe for many teams.

6.7 Design Pattern: TuranMart’s Bronze-Silver-Gold Lake

TuranMart chooses one S3-compatible development bucket named turanmart-lake-dev and one production bucket per environment. Production may later split domains into multiple buckets for compliance, but the first implementation keeps a single bucket with strict prefixes so the team can validate data flow before expanding platform boundaries.

The guided lab uses a predictable object layout so readers can inspect raw evidence, cleaned Parquet outputs, rejected records, and published Gold aggregates.

Figure 6:The guided lab uses a predictable object layout so readers can inspect raw evidence, cleaned Parquet outputs, rejected records, and published Gold aggregates.

LayerPrefix patternFile formatMain consumerPromotion rule
Bronzebronze/{domain}/{dataset}/ingest_date=YYYY-MM-DD/Source-native JSONL or CSVData engineers and auditorsLand exactly once with ingestion metadata.
Silversilver/{domain}/{dataset}_clean/{business_date=YYYY-MM-DD}/Parquet plus rejection recordsData analysts, data scientists, quality jobsPass schema, type, duplicate, and validity checks.
Goldgold/{domain}/{metric_or_mart}/{business_date=YYYY-MM-DD}/Parquet or compact CSV extractsBI, finance, operations, ML featuresUse documented business definitions and freshness SLA.

The first version intentionally avoids advanced table formats so the storage lesson remains clear. The lake writes immutable files, validates deterministic outputs, and keeps rejected records visible. Chapter 7 will add warehouse and lakehouse table semantics such as dimensional modeling, slowly changing dimensions, and transactional table workflows.

6.8 Guided Lab: Build a Local Bronze-Silver-Gold Lake with MinIO Layouts and Parquet

6.8.1 Lab Goal

In this lab, you will create a small TuranMart data lake using a local folder that mirrors object-storage keys. The Docker Compose file also starts MinIO so you can inspect the same layout through an S3-compatible service. The starter script lands raw order and product files in Bronze, validates and cleans them into Silver Parquet datasets, writes rejected records, and publishes a Gold daily revenue table.

6.8.2 Lab Materials

Lab materialPurposeLink
Lab READMEStep-by-step instructions, cleanup, and troubleshooting.README
Docker ComposeOptional MinIO service for S3-compatible inspection.docker-compose.yml
RequirementsPython packages for Parquet and optional S3 access.requirements.txt
Starter pipelineBuilds Bronze, Silver, and Gold outputs.starter.py
ValidatorChecks deterministic row counts, aggregates, and file layout.tests/validate_lab.py
Raw dataSmall order and product inputs.data
Expected outputReference aggregate and manifest.expected_output
Solution guideExplanation of design choices and expected results.solution

6.8.3 Quick Start

From the repository root, run the lab locally. MinIO is optional for the deterministic validation path, but the Compose file is included so readers can inspect buckets and objects through a local S3-compatible console.

cd shared/labs/ch06_object_storage_data_lake
python3 -m pip install -r requirements.txt
python3 starter.py --lake-root .lake/turanmart-lake-dev
python3 tests/validate_lab.py --lake-root .lake/turanmart-lake-dev

If you want to inspect the same project with MinIO, start the service and open the console at http://localhost:9001. Use the credentials in the lab README, create a bucket named turanmart-lake-dev, and upload the generated bronze/, silver/, and gold/ folders if you want to practice object operations.

docker compose up -d

6.8.4 Expected Output

A successful run prints a compact validation summary and writes deterministic Gold output. The exact object-store transport can differ by environment, but the rows and metrics should match.

BRONZE orders rows: 8
SILVER clean orders rows: 6
SILVER rejected orders rows: 2
GOLD daily revenue rows: 3
GOLD total_revenue: 575.50
VALIDATION PASSED
order_dateorder_countitem_counttotal_revenue
2026-05-2823203.50
2026-05-2922192.00
2026-05-3025180.00

6.8.5 Completion Checklist

You have completed the lab when the Bronze folder contains source-native files, the Silver folder contains Parquet data plus a visible rejection file, the Gold folder contains a daily revenue Parquet dataset, and the validator prints VALIDATION PASSED. You should also be able to explain why invalid rows were rejected and why the Gold table is smaller than the Silver table.

6.8.6 Troubleshooting

SymptomLikely causeFix
ModuleNotFoundError: pyarrowRequirements were not installed in the active Python environment.Run python3 -m pip install -r requirements.txt from the lab folder.
MinIO console does not openDocker is not running or ports 9000/9001 are already in use.Start Docker, stop the conflicting service, or edit the Compose ports.
Validator reports wrong revenueStarter output is stale or the raw data was edited.Delete .lake/ and rerun starter.py before validating.
Too many output filesThe script was modified to write one file per row or partition.Compact small datasets and use stable partition columns.
Objects are visible but hard to queryPrefixes do not follow the documented zone/domain/dataset pattern.Recreate the lake using the starter defaults and compare paths.

6.8.7 Cleanup

docker compose down -v
rm -rf .lake

Common Pitfalls

A data lake usually fails gradually. The first week feels productive because files are easy to land. The failure appears months later when nobody can identify the owner of a dataset, analysts query the wrong prefix, raw data contains untracked PII, and compute bills grow because every query scans thousands of tiny files.

PitfallWhy it hurtsBetter practice
Treating object storage like a shared diskApplications assume rename, overwrite, locking, or POSIX behavior that object stores do not guarantee in the same way.Design pipelines around immutable files, idempotent writes, and explicit manifests.
Creating too many tiny filesQuery engines spend excessive time listing, planning, and opening files.Compact outputs and target file sizes appropriate for the processing engine.
Partitioning by high-cardinality columnsThe lake gains millions of folders but little query pruning benefit.Partition by common filters such as date, region, or domain.
Skipping rejected-record handlingBad data disappears or silently corrupts aggregates.Quarantine invalid records with rejection reasons and ingestion metadata.
Publishing Gold without definitionsTeams argue about metrics because logic is hidden in scripts.Document business definitions, owners, freshness, and source lineage.
Ignoring lifecycle and retentionOld data remains forever in expensive classes or is deleted without compliance review.Define retention, archival, and expiration policies per zone and domain.

Mini-Capstone: Lake Design Record for TuranMart

Write a one-page lake design record for TuranMart’s next production iteration. Your record should identify the bucket strategy, prefix convention, Bronze/Silver/Gold responsibilities, file formats, partitioning strategy, quality gates, access roles, lifecycle policy, and first three datasets to onboard. Keep the design specific enough that another engineer could implement the first version without asking what raw, clean, or published mean.

Decision areaMinimum answer required
Bucket strategyOne bucket per environment, one bucket per domain, or another justified pattern.
Prefix conventionZone, domain, dataset, source, partition, and batch/run naming.
Quality gatesChecks required before data moves from Bronze to Silver and Silver to Gold.
Governance controlsCatalog, owner, access, encryption, retention, and audit approach.
Operational planMonitoring, backfill, deletion, and cost-control strategy.

Exercises

DifficultyExerciseExpected result
EasyAdd a new valid order row to the raw input and rerun the lab.The Silver row count and Gold aggregate for that date change predictably.
MediumAdd a malformed order with a missing product_id and negative quantity.The row appears in the Silver rejection file with an explainable reason.
MediumChange the Gold aggregation from daily revenue to daily revenue by channel.The Gold table gains a channel dimension while total revenue remains consistent.
ChallengeModify the script to upload generated objects to MinIO using boto3 when endpoint variables are present.Local filesystem mode still validates, and S3 mode writes the same keys to the MinIO bucket.
TeamDesign a production lifecycle policy for Bronze, Silver, and Gold zones.The team can explain retention, archival, legal hold, and cost implications.

Review Questions

QuestionWhat a strong answer should include
Why does object storage fit data lake workloads better than block storage?API-based scale, durability, cost profile, decoupled compute, and analytical file patterns.
Why are object prefixes important if the namespace is technically flat?Prefixes drive human organization, partition pruning, lifecycle rules, access design, and catalog usability.
What is the difference between Bronze and Silver?Bronze preserves raw evidence; Silver enforces schema, quality, deduplication, and conformed types.
Why is Parquet common in Silver and Gold zones?Open columnar format, efficient analytical reads, compression, and broad engine support.
What governance controls should exist before onboarding sensitive datasets?Ownership, catalog metadata, least-privilege access, encryption, privacy classification, audit logging, and retention.
What problem does Chapter 7 solve after this chapter?It adds warehouse and lakehouse table modeling, including business schemas and transactional table workflows on top of lake storage.

Summary

Object storage is the durable API substrate behind many modern data lakes. It stores data as objects in buckets, uses keys and prefixes rather than true directories, and scales independently from the compute engines that process the data. S3-compatible APIs make the model portable enough to learn locally with MinIO and deploy with managed services such as Amazon S3 or Alibaba Cloud OSS.

A data lake becomes useful only when its layout, file formats, quality layers, and governance controls are intentional. Bronze preserves source evidence, Silver validates and conforms records, and Gold publishes business-ready data products. Prefix conventions, Parquet files, lifecycle rules, access policies, catalogs, and quality checks are not administrative extras; they are the engineering controls that prevent the lake from becoming a swamp.

In the next chapter, you will build on this storage foundation by moving into data warehouses and lakehouse tables. The central question changes from “Where do we store analytical files?” to “How do we model, query, update, and govern analytical tables for business users?”

References

Footnotes