Introduction¶
A consistent and powerful development environment is crucial for any data engineer. This appendix guides you through setting up a complete, containerized environment using Docker and Visual Studio Code (VS Code), which mirrors the setup used for the exercises in this book. This ensures that all examples and solutions work consistently, regardless of your host operating system (Windows, macOS, or Linux).
Core Technologies¶
Our development environment is built on these core technologies:
| Technology | Purpose | Why We Use It |
|---|---|---|
| Docker & Docker Compose | Containerization | Provides isolated, reproducible environments for all services (databases, Spark, etc.) |
| Visual Studio Code | Code Editor | Excellent support for Docker, Python, SQL, and remote development |
| Windows Subsystem for Linux (WSL) 2 | Linux on Windows | Provides a native Linux environment on Windows for better performance and compatibility |
| Git & GitHub | Version Control | Essential for managing code, collaborating, and tracking changes |
Step 1: Install Prerequisites¶
Before you begin, you must install the following software on your host machine.
For All Operating Systems¶
Visual Studio Code: Download and install the latest version from the [official website][1].
Docker Desktop: Download and install Docker Desktop for your OS from the [Docker Hub][2]. This single installation includes both Docker Engine and Docker Compose.
On Windows, Docker Desktop will automatically use WSL 2 as its backend, which is the recommended setup.
On macOS, ensure you are using a Mac with Apple silicon (M1/M2/M3) or a recent Intel-based Mac for best performance.
Git: Install Git from the [official website][3]. This is essential for version control and for cloning the book’s repository.
Windows-Specific: Install WSL 2¶
If you are on Windows 10 (version 2004 or higher) or Windows 11, installing WSL 2 is highly recommended for the best Docker performance.
Enable WSL: Open PowerShell as an administrator and run:
wsl --installThis command enables the necessary Windows features, downloads the latest Linux kernel, and installs Ubuntu as the default distribution.
Set WSL 2 as Default: Ensure WSL 2 is your default version:
wsl --set-default-version 2Integrate with Docker Desktop: In Docker Desktop settings, go to Resources > WSL Integration and ensure that integration with your default WSL distro is enabled.
VS Code Extensions¶
To turn VS Code into a powerful data engineering IDE, install these essential extensions. You can find them in the Extensions view (Ctrl+Shift+X):
Docker (by Microsoft): For managing containers, images, and Docker Compose directly from VS Code.
Python (by Microsoft): Provides rich support for Python, including IntelliSense, linting, debugging, and Jupyter Notebooks.
Dev Containers (by Microsoft): Allows you to use a Docker container as a full-featured development environment.
SQLTools (by Matheus Teixeira): A powerful SQL client for connecting to PostgreSQL, MySQL, and other databases.
Markdown All in One (by Yu Zhang): For writing and previewing Markdown files.
Step 2: Clone the GitHub Repository¶
All the code, exercises, and Docker configurations for this book are in a single GitHub repository.
Open your terminal (or Git Bash on Windows).
Navigate to the directory where you want to store the project.
Clone the repository:
git clone https://github.com/your-username/data-engineering-in-action.git cd data-engineering-in-action
Step 3: Launch the Docker Environment¶
The docker-compose.yml file in the root of the repository defines all the services you’ll need for the exercises, including databases, Spark, Kafka, and more.
Open the Project in VS Code:
code .Start All Services: Open the integrated terminal in VS Code (`Ctrl+``) and run:
docker-compose up -dThe
-dflag runs the containers in detached mode (in the background). This command will download all the necessary Docker images and start the services. This may take some time on the first run.Verify Running Containers: You can see the status of your containers in a few ways:
Via Terminal: Run
docker-compose psordocker ps.Via VS Code Docker Extension: Open the Docker view in the sidebar to see all running containers.
You should see services like
de-postgres,de-mysql,de-kafka,de-spark-master, etc., with arunningorhealthystatus.
Step 4: Verify Your Setup¶
The repository includes a verification script to ensure everything is configured correctly.
Set up Python Virtual Environment (Recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall Python Dependencies:
pip install -r requirements.txtRun the Verification Script:
python scripts/verify_setup.py
The script will check your Python version, Docker installation, running services, and database connections. If all checks pass, you are ready to go! If any checks fail, the script will provide guidance on how to fix the issue.
Step 5: Connect to Services¶
Here’s how to connect to the key services running in your Docker environment.
Connecting to Databases with SQLTools¶
Open the SQLTools view in the VS Code sidebar.
Click “Add New Connection” and select the database type (e.g., PostgreSQL).
Use the credentials from the
docker-compose.ymlfile. For example, for PostgreSQL:Server Address:
localhostPort:
5432Database:
dataeng_dbUsername:
dataengPassword:
dataeng123
Accessing Web UIs¶
Several services have web interfaces that you can access in your browser:
| Service | URL | Description |
|---|---|---|
| Spark Master | http://localhost:8080 | View Spark cluster status, workers, and applications. |
| MinIO Console | http://localhost:9001 | S3-compatible object storage browser. (User: minioadmin, Pass: minioadmin123) |
| Jupyter Lab | http://localhost:8888 | Interactive development with notebooks. |
Environment Management¶
Common Docker Compose Commands¶
Start all services:
docker-compose up -dStop all services:
docker-compose downView logs for a service:
docker-compose logs -f postgresRestart a service:
docker-compose restart kafkaExecute a command in a container:
docker-compose exec postgres psql -U dataeng -d dataeng_db
Cleaning Up¶
To stop and remove all containers, networks, and volumes created by Docker Compose, run:
docker-compose down --volumesTroubleshooting¶
“Port is already allocated”: Another service on your machine is using a required port. Stop the conflicting service or change the port mapping in
docker-compose.yml. For example, change"5432:5432"to"5433:5432"to map the container’s port 5432 to your host’s port 5433.Container fails to start: Check the logs with
docker-compose logs <service_name>for error messages. This is often due to insufficient memory allocated to Docker Desktop.Permission Denied: On Linux or macOS, you may need to run Docker commands with
sudoif your user is not in thedockergroup.
With this setup, you have a powerful, self-contained data engineering lab on your local machine, ready to tackle all the exercises in this book.
References¶
[1]: Visual Studio Code - Official Website [2]: Docker Hub - Docker Desktop [3]: Git - Official Website