Introduction¶
Structured Query Language (SQL) and Python are the two most critical languages for any data engineer. SQL is the universal language for interacting with relational databases, while Python provides the flexibility, powerful libraries, and glue needed to build complex data pipelines. This appendix serves as a practical primer or a refresher on the essential concepts of both languages that are most relevant to data engineering.
Part 1: Essential SQL for Data Engineering¶
SQL is used for managing and querying structured data. As a data engineer, you will use it for everything from data definition and manipulation to complex analytical queries.
Data Definition Language (DDL)¶
DDL statements are used to define and manage the database schema.
CREATE TABLE: Defines a new table.CREATE TABLE employees ( employee_id SERIAL PRIMARY KEY, first_name VARCHAR(50) NOT NULL, last_name VARCHAR(50) NOT NULL, email VARCHAR(100) UNIQUE, hire_date DATE DEFAULT CURRENT_DATE, department_id INT, salary DECIMAL(10, 2) CHECK (salary > 0), CONSTRAINT fk_department FOREIGN KEY (department_id) REFERENCES departments(department_id) );ALTER TABLE: Modifies an existing table.-- Add a new column ALTER TABLE employees ADD COLUMN is_active BOOLEAN DEFAULT TRUE; -- Modify a column's data type ALTER TABLE employees ALTER COLUMN email TYPE VARCHAR(150); -- Add a constraint ALTER TABLE employees ADD CONSTRAINT check_hire_date CHECK (hire_date <= CURRENT_DATE);DROP TABLE: Deletes a table.DROP TABLE IF EXISTS employees CASCADE;
Data Manipulation Language (DML)¶
DML statements are used to add, modify, and delete data.
INSERT: Adds new rows to a table.INSERT INTO employees (first_name, last_name, email, department_id, salary) VALUES ( ("Jane", "Doe", "jane.doe@example.com", 1, 80000), ("John", "Smith", "john.smith@example.com", 2, 95000) );UPDATE: Modifies existing rows.UPDATE employees SET salary = salary * 1.05 WHERE department_id = 1;DELETE: Removes rows from a table.DELETE FROM employees WHERE is_active = FALSE;
Data Query Language (DQL)¶
DQL is used to retrieve data. This is where data engineers spend a significant amount of time.
SELECT: The cornerstone of data retrieval.SELECT employee_id, first_name || " " || last_name AS full_name, salary FROM employees WHERE department_id = 2 ORDER BY salary DESC LIMIT 10;JOIN: Combines rows from two or more tables.SELECT e.first_name, d.department_name FROM employees e INNER JOIN departments d ON e.department_id = d.department_id;Common join types include
INNER JOIN,LEFT JOIN,RIGHT JOIN, andFULL OUTER JOIN.GROUP BYand Aggregate Functions: Used for summarizing data.SELECT d.department_name, COUNT(e.employee_id) AS number_of_employees, AVG(e.salary) AS average_salary FROM employees e JOIN departments d ON e.department_id = d.department_id GROUP BY d.department_name HAVING COUNT(e.employee_id) > 5;
Advanced SQL: Window Functions¶
Window functions perform calculations across a set of table rows that are somehow related to the current row. They are incredibly powerful for analytics.
ROW_NUMBER(),RANK(),DENSE_RANK(): Assign a rank to rows.-- Find the top 3 highest-paid employees in each department WITH RankedSalaries AS ( SELECT first_name, department_name, salary, RANK() OVER (PARTITION BY d.department_name ORDER BY e.salary DESC) as salary_rank FROM employees e JOIN departments d ON e.department_id = d.department_id ) SELECT * FROM RankedSalaries WHERE salary_rank <= 3;LAG()andLEAD(): Access data from a previous or subsequent row.-- Calculate month-over-month sales growth SELECT sale_month, monthly_revenue, LAG(monthly_revenue, 1) OVER (ORDER BY sale_month) AS previous_month_revenue, (monthly_revenue - LAG(monthly_revenue, 1) OVER (ORDER BY sale_month)) / LAG(monthly_revenue, 1) OVER (ORDER BY sale_month) AS growth_rate FROM monthly_sales;
Common Table Expressions (CTEs)¶
CTEs (using the WITH clause) make complex queries more readable by breaking them down into logical, sequential steps.
WITH DepartmentSalaries AS (
SELECT
department_id,
AVG(salary) as avg_dept_salary
FROM employees
GROUP BY department_id
)
SELECT
e.first_name,
e.salary,
ds.avg_dept_salary
FROM employees e
JOIN DepartmentSalaries ds ON e.department_id = ds.department_id
WHERE e.salary > ds.avg_dept_salary;Part 2: Essential Python for Data Engineering¶
Python is the de facto language for data engineering due to its simplicity, extensive libraries, and strong community support.
Core Data Structures¶
A solid understanding of Python’s built-in data structures is essential.
Lists: Ordered, mutable collections.
[1, 2, 3]Tuples: Ordered, immutable collections.
(1, 2, 3)Dictionaries: Unordered, key-value pairs.
{"key": "value"}Sets: Unordered, unique elements.
{1, 2, 3}
List Comprehensions and Generators¶
These are idiomatic and efficient ways to create lists and iterators.
# List comprehension to get squares of even numbers
even_squares = [x**2 for x in range(10) if x % 2 == 0]
# Generator expression for memory-efficient iteration
even_squares_gen = (x**2 for x in range(10) if x % 2 == 0)Functions and Decorators¶
Functions are the building blocks of any application. Decorators are a powerful way to modify or enhance functions.
import time
def timing_decorator(func):
"""A decorator to measure the execution time of a function."""
def wrapper(*args, **kwargs):
start_time = time.time()
result = func(*args, **kwargs)
end_time = time.time()
print(f"Function {func.__name__} took {end_time - start_time:.4f} seconds")
return result
return wrapper
@timing_decorator
def process_large_file(file_path):
# Simulate processing
time.sleep(2)
print(f"Processed {file_path}")
process_large_file("/path/to/data.csv")File Handling and Context Managers¶
Data engineers constantly read from and write to files. The with statement ensures that files are properly closed.
# Reading a file line by line
try:
with open("data.csv", "r") as f:
for line in f:
print(line.strip())
except FileNotFoundError:
print("File not found.")
# Writing to a file
with open("output.txt", "w") as f:
f.write("Hello, Data Engineering!\n")Key Data Engineering Libraries¶
Pandas: The go-to library for in-memory data manipulation and analysis. It provides a powerful
DataFrameobject.import pandas as pd # Read CSV into a DataFrame df = pd.read_csv("sales_data.csv") # Basic data cleaning df["order_date"] = pd.to_datetime(df["order_date"]) df["revenue"].fillna(0, inplace=True) # Group by and aggregate monthly_sales = df.groupby(df["order_date"].dt.to_period("M"))["revenue"].sum() print(monthly_sales)Requests: The standard library for making HTTP requests, essential for interacting with APIs.
import requests response = requests.get("https://api.github.com/users/google") if response.status_code == 200: data = response.json() print(f"Google has {data["public_repos"]} public repositories.")Psycopg2 / PyMySQL: Database drivers for connecting to PostgreSQL and MySQL.
import psycopg2 conn = psycopg2.connect( host="localhost", database="dataeng_db", user="dataeng", password="dataeng123" ) cursor = conn.cursor() cursor.execute("SELECT version();") print(cursor.fetchone()) cursor.close() conn.close()PySpark: The Python API for Apache Spark, used for large-scale distributed data processing.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("SimpleApp").getOrCreate() log_data = spark.read.text("README.md").cache() num_as = log_data.filter(log_data.value.contains("a")).count() num_bs = log_data.filter(log_data.value.contains("b")).count() print(f"Lines with a: {num_as}, lines with b: {num_bs}") spark.stop()
Virtual Environments¶
Always use virtual environments to manage project dependencies and avoid conflicts. venv is built into Python.
Create an environment:
python3 -m venv my_project_envActivate it:
macOS/Linux:
source my_project_env/bin/activateWindows:
my_project_env\Scripts\activate
Install packages:
pip install pandas requestsSave dependencies:
pip freeze > requirements.txt
This primer covers the foundational skills in SQL and Python that you will use daily as a data engineer. Mastering these concepts will provide a solid base for tackling the more advanced topics covered in this book.