Appendix B: A SQL and Python Primer for Data Engineers

Introduction¶

Structured Query Language (SQL) and Python are the two most critical languages for any data engineer. SQL is the universal language for interacting with relational databases, while Python provides the flexibility, powerful libraries, and glue needed to build complex data pipelines. This appendix serves as a practical primer or a refresher on the essential concepts of both languages that are most relevant to data engineering.

Part 1: Essential SQL for Data Engineering¶

SQL is used for managing and querying structured data. As a data engineer, you will use it for everything from data definition and manipulation to complex analytical queries.

Data Definition Language (DDL)¶

DDL statements are used to define and manage the database schema.

CREATE TABLE: Defines a new table.

CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    email VARCHAR(100) UNIQUE,
    hire_date DATE DEFAULT CURRENT_DATE,
    department_id INT,
    salary DECIMAL(10, 2) CHECK (salary > 0),
    CONSTRAINT fk_department FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

ALTER TABLE: Modifies an existing table.

-- Add a new column
ALTER TABLE employees ADD COLUMN is_active BOOLEAN DEFAULT TRUE;

-- Modify a column's data type
ALTER TABLE employees ALTER COLUMN email TYPE VARCHAR(150);

-- Add a constraint
ALTER TABLE employees ADD CONSTRAINT check_hire_date CHECK (hire_date <= CURRENT_DATE);

DROP TABLE: Deletes a table.
```
DROP TABLE IF EXISTS employees CASCADE;
```

Data Manipulation Language (DML)¶

DML statements are used to add, modify, and delete data.

INSERT: Adds new rows to a table.

INSERT INTO employees (first_name, last_name, email, department_id, salary)
VALUES (
    ("Jane", "Doe", "jane.doe@example.com", 1, 80000),
    ("John", "Smith", "john.smith@example.com", 2, 95000)
);

UPDATE: Modifies existing rows.

UPDATE employees
SET salary = salary * 1.05
WHERE department_id = 1;

DELETE: Removes rows from a table.

DELETE FROM employees
WHERE is_active = FALSE;

Data Query Language (DQL)¶

DQL is used to retrieve data. This is where data engineers spend a significant amount of time.

SELECT: The cornerstone of data retrieval.

SELECT 
    employee_id,
    first_name || " " || last_name AS full_name,
    salary
FROM employees
WHERE department_id = 2
ORDER BY salary DESC
LIMIT 10;

JOIN: Combines rows from two or more tables.

SELECT
    e.first_name,
    d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

Common join types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

GROUP BY and Aggregate Functions: Used for summarizing data.

SELECT
    d.department_name,
    COUNT(e.employee_id) AS number_of_employees,
    AVG(e.salary) AS average_salary
FROM employees e
JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_name
HAVING COUNT(e.employee_id) > 5;

Advanced SQL: Window Functions¶

Window functions perform calculations across a set of table rows that are somehow related to the current row. They are incredibly powerful for analytics.

ROW_NUMBER(), RANK(), DENSE_RANK(): Assign a rank to rows.

-- Find the top 3 highest-paid employees in each department
WITH RankedSalaries AS (
    SELECT
        first_name,
        department_name,
        salary,
        RANK() OVER (PARTITION BY d.department_name ORDER BY e.salary DESC) as salary_rank
    FROM employees e
    JOIN departments d ON e.department_id = d.department_id
)
SELECT *
FROM RankedSalaries
WHERE salary_rank <= 3;

LAG() and LEAD(): Access data from a previous or subsequent row.

-- Calculate month-over-month sales growth
SELECT
    sale_month,
    monthly_revenue,
    LAG(monthly_revenue, 1) OVER (ORDER BY sale_month) AS previous_month_revenue,
    (monthly_revenue - LAG(monthly_revenue, 1) OVER (ORDER BY sale_month)) / LAG(monthly_revenue, 1) OVER (ORDER BY sale_month) AS growth_rate
FROM monthly_sales;

Common Table Expressions (CTEs)¶

CTEs (using the WITH clause) make complex queries more readable by breaking them down into logical, sequential steps.

WITH DepartmentSalaries AS (
    SELECT
        department_id,
        AVG(salary) as avg_dept_salary
    FROM employees
    GROUP BY department_id
)
SELECT
    e.first_name,
    e.salary,
    ds.avg_dept_salary
FROM employees e
JOIN DepartmentSalaries ds ON e.department_id = ds.department_id
WHERE e.salary > ds.avg_dept_salary;

Part 2: Essential Python for Data Engineering¶

Python is the de facto language for data engineering due to its simplicity, extensive libraries, and strong community support.

Core Data Structures¶

A solid understanding of Python’s built-in data structures is essential.

Lists: Ordered, mutable collections. [1, 2, 3]
Tuples: Ordered, immutable collections. (1, 2, 3)
Dictionaries: Unordered, key-value pairs. {"key": "value"}
Sets: Unordered, unique elements. {1, 2, 3}

List Comprehensions and Generators¶

These are idiomatic and efficient ways to create lists and iterators.

# List comprehension to get squares of even numbers
even_squares = [x**2 for x in range(10) if x % 2 == 0]

# Generator expression for memory-efficient iteration
even_squares_gen = (x**2 for x in range(10) if x % 2 == 0)

Functions and Decorators¶

Functions are the building blocks of any application. Decorators are a powerful way to modify or enhance functions.

import time

def timing_decorator(func):
    """A decorator to measure the execution time of a function."""
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Function {func.__name__} took {end_time - start_time:.4f} seconds")
        return result
    return wrapper

@timing_decorator
def process_large_file(file_path):
    # Simulate processing
    time.sleep(2)
    print(f"Processed {file_path}")

process_large_file("/path/to/data.csv")

File Handling and Context Managers¶

Data engineers constantly read from and write to files. The with statement ensures that files are properly closed.

# Reading a file line by line
try:
    with open("data.csv", "r") as f:
        for line in f:
            print(line.strip())
except FileNotFoundError:
    print("File not found.")

# Writing to a file
with open("output.txt", "w") as f:
    f.write("Hello, Data Engineering!\n")

Key Data Engineering Libraries¶

Pandas: The go-to library for in-memory data manipulation and analysis. It provides a powerful DataFrame object.

import pandas as pd

# Read CSV into a DataFrame
df = pd.read_csv("sales_data.csv")

# Basic data cleaning
df["order_date"] = pd.to_datetime(df["order_date"])
df["revenue"].fillna(0, inplace=True)

# Group by and aggregate
monthly_sales = df.groupby(df["order_date"].dt.to_period("M"))["revenue"].sum()
print(monthly_sales)

Requests: The standard library for making HTTP requests, essential for interacting with APIs.

import requests

response = requests.get("https://api.github.com/users/google")
if response.status_code == 200:
    data = response.json()
    print(f"Google has {data["public_repos"]} public repositories.")

Psycopg2 / PyMySQL: Database drivers for connecting to PostgreSQL and MySQL.

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="dataeng_db",
    user="dataeng",
    password="dataeng123"
)
cursor = conn.cursor()
cursor.execute("SELECT version();")
print(cursor.fetchone())
cursor.close()
conn.close()

PySpark: The Python API for Apache Spark, used for large-scale distributed data processing.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

log_data = spark.read.text("README.md").cache()

num_as = log_data.filter(log_data.value.contains("a")).count()
num_bs = log_data.filter(log_data.value.contains("b")).count()

print(f"Lines with a: {num_as}, lines with b: {num_bs}")

spark.stop()

Virtual Environments¶

Always use virtual environments to manage project dependencies and avoid conflicts. venv is built into Python.

Create an environment:
```
python3 -m venv my_project_env
```
Activate it:
- macOS/Linux: source my_project_env/bin/activate
- Windows: my_project_env\Scripts\activate
Install packages:
```
pip install pandas requests
```
Save dependencies:
```
pip freeze > requirements.txt
```

This primer covers the foundational skills in SQL and Python that you will use daily as a data engineer. Mastering these concepts will provide a solid base for tackling the more advanced topics covered in this book.