Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Appendix B: A SQL and Python Primer for Data Engineers

Introduction

Structured Query Language (SQL) and Python are the two most critical languages for any data engineer. SQL is the universal language for interacting with relational databases, while Python provides the flexibility, powerful libraries, and glue needed to build complex data pipelines. This appendix serves as a practical primer or a refresher on the essential concepts of both languages that are most relevant to data engineering.

Part 1: Essential SQL for Data Engineering

SQL is used for managing and querying structured data. As a data engineer, you will use it for everything from data definition and manipulation to complex analytical queries.

Data Definition Language (DDL)

DDL statements are used to define and manage the database schema.

Data Manipulation Language (DML)

DML statements are used to add, modify, and delete data.

Data Query Language (DQL)

DQL is used to retrieve data. This is where data engineers spend a significant amount of time.

Advanced SQL: Window Functions

Window functions perform calculations across a set of table rows that are somehow related to the current row. They are incredibly powerful for analytics.

Common Table Expressions (CTEs)

CTEs (using the WITH clause) make complex queries more readable by breaking them down into logical, sequential steps.

WITH DepartmentSalaries AS (
    SELECT
        department_id,
        AVG(salary) as avg_dept_salary
    FROM employees
    GROUP BY department_id
)
SELECT
    e.first_name,
    e.salary,
    ds.avg_dept_salary
FROM employees e
JOIN DepartmentSalaries ds ON e.department_id = ds.department_id
WHERE e.salary > ds.avg_dept_salary;

Part 2: Essential Python for Data Engineering

Python is the de facto language for data engineering due to its simplicity, extensive libraries, and strong community support.

Core Data Structures

A solid understanding of Python’s built-in data structures is essential.

List Comprehensions and Generators

These are idiomatic and efficient ways to create lists and iterators.

# List comprehension to get squares of even numbers
even_squares = [x**2 for x in range(10) if x % 2 == 0]

# Generator expression for memory-efficient iteration
even_squares_gen = (x**2 for x in range(10) if x % 2 == 0)

Functions and Decorators

Functions are the building blocks of any application. Decorators are a powerful way to modify or enhance functions.

import time

def timing_decorator(func):
    """A decorator to measure the execution time of a function."""
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Function {func.__name__} took {end_time - start_time:.4f} seconds")
        return result
    return wrapper

@timing_decorator
def process_large_file(file_path):
    # Simulate processing
    time.sleep(2)
    print(f"Processed {file_path}")

process_large_file("/path/to/data.csv")

File Handling and Context Managers

Data engineers constantly read from and write to files. The with statement ensures that files are properly closed.

# Reading a file line by line
try:
    with open("data.csv", "r") as f:
        for line in f:
            print(line.strip())
except FileNotFoundError:
    print("File not found.")

# Writing to a file
with open("output.txt", "w") as f:
    f.write("Hello, Data Engineering!\n")

Key Data Engineering Libraries

Virtual Environments

Always use virtual environments to manage project dependencies and avoid conflicts. venv is built into Python.

  1. Create an environment:

    python3 -m venv my_project_env
  2. Activate it:

    • macOS/Linux: source my_project_env/bin/activate

    • Windows: my_project_env\Scripts\activate

  3. Install packages:

    pip install pandas requests
  4. Save dependencies:

    pip freeze > requirements.txt

This primer covers the foundational skills in SQL and Python that you will use daily as a data engineer. Mastering these concepts will provide a solid base for tackling the more advanced topics covered in this book.