Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Appendix A: Setting Up Your Development Environment

Introduction

A consistent and powerful development environment is crucial for any data engineer. This appendix guides you through setting up a complete, containerized environment using Docker and Visual Studio Code (VS Code), which mirrors the setup used for the exercises in this book. This ensures that all examples and solutions work consistently, regardless of your host operating system (Windows, macOS, or Linux).

Core Technologies

Our development environment is built on these core technologies:

TechnologyPurposeWhy We Use It
Docker & Docker ComposeContainerizationProvides isolated, reproducible environments for all services (databases, Spark, etc.)
Visual Studio CodeCode EditorExcellent support for Docker, Python, SQL, and remote development
Windows Subsystem for Linux (WSL) 2Linux on WindowsProvides a native Linux environment on Windows for better performance and compatibility
Git & GitHubVersion ControlEssential for managing code, collaborating, and tracking changes

Step 1: Install Prerequisites

Before you begin, you must install the following software on your host machine.

For All Operating Systems

  1. Visual Studio Code: Download and install the latest version from the [official website][1].

  2. Docker Desktop: Download and install Docker Desktop for your OS from the [Docker Hub][2]. This single installation includes both Docker Engine and Docker Compose.

    • On Windows, Docker Desktop will automatically use WSL 2 as its backend, which is the recommended setup.

    • On macOS, ensure you are using a Mac with Apple silicon (M1/M2/M3) or a recent Intel-based Mac for best performance.

  3. Git: Install Git from the [official website][3]. This is essential for version control and for cloning the book’s repository.

Windows-Specific: Install WSL 2

If you are on Windows 10 (version 2004 or higher) or Windows 11, installing WSL 2 is highly recommended for the best Docker performance.

  1. Enable WSL: Open PowerShell as an administrator and run:

    wsl --install

    This command enables the necessary Windows features, downloads the latest Linux kernel, and installs Ubuntu as the default distribution.

  2. Set WSL 2 as Default: Ensure WSL 2 is your default version:

    wsl --set-default-version 2
  3. Integrate with Docker Desktop: In Docker Desktop settings, go to Resources > WSL Integration and ensure that integration with your default WSL distro is enabled.

VS Code Extensions

To turn VS Code into a powerful data engineering IDE, install these essential extensions. You can find them in the Extensions view (Ctrl+Shift+X):

Step 2: Clone the GitHub Repository

All the code, exercises, and Docker configurations for this book are in a single GitHub repository.

  1. Open your terminal (or Git Bash on Windows).

  2. Navigate to the directory where you want to store the project.

  3. Clone the repository:

    git clone https://github.com/your-username/data-engineering-in-action.git
    cd data-engineering-in-action

Step 3: Launch the Docker Environment

The docker-compose.yml file in the root of the repository defines all the services you’ll need for the exercises, including databases, Spark, Kafka, and more.

  1. Open the Project in VS Code:

    code .
  2. Start All Services: Open the integrated terminal in VS Code (`Ctrl+``) and run:

    docker-compose up -d

    The -d flag runs the containers in detached mode (in the background). This command will download all the necessary Docker images and start the services. This may take some time on the first run.

  3. Verify Running Containers: You can see the status of your containers in a few ways:

    • Via Terminal: Run docker-compose ps or docker ps.

    • Via VS Code Docker Extension: Open the Docker view in the sidebar to see all running containers.

    You should see services like de-postgres, de-mysql, de-kafka, de-spark-master, etc., with a running or healthy status.

Step 4: Verify Your Setup

The repository includes a verification script to ensure everything is configured correctly.

  1. Set up Python Virtual Environment (Recommended):

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  2. Install Python Dependencies:

    pip install -r requirements.txt
  3. Run the Verification Script:

    python scripts/verify_setup.py

The script will check your Python version, Docker installation, running services, and database connections. If all checks pass, you are ready to go! If any checks fail, the script will provide guidance on how to fix the issue.

Step 5: Connect to Services

Here’s how to connect to the key services running in your Docker environment.

Connecting to Databases with SQLTools

  1. Open the SQLTools view in the VS Code sidebar.

  2. Click “Add New Connection” and select the database type (e.g., PostgreSQL).

  3. Use the credentials from the docker-compose.yml file. For example, for PostgreSQL:

    • Server Address: localhost

    • Port: 5432

    • Database: dataeng_db

    • Username: dataeng

    • Password: dataeng123

Accessing Web UIs

Several services have web interfaces that you can access in your browser:

ServiceURLDescription
Spark Masterhttp://localhost:8080View Spark cluster status, workers, and applications.
MinIO Consolehttp://localhost:9001S3-compatible object storage browser. (User: minioadmin, Pass: minioadmin123)
Jupyter Labhttp://localhost:8888Interactive development with notebooks.

Environment Management

Common Docker Compose Commands

Cleaning Up

To stop and remove all containers, networks, and volumes created by Docker Compose, run:

docker-compose down --volumes

Troubleshooting

With this setup, you have a powerful, self-contained data engineering lab on your local machine, ready to tackle all the exercises in this book.


References

[1]: Visual Studio Code - Official Website [2]: Docker Hub - Docker Desktop [3]: Git - Official Website