Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 22: Case Study: Fraud Detection in Financial Services

Financial fraud is one of the clearest examples of why data engineering must be designed as an operational discipline, not only as a reporting function. A payment company cannot wait until tomorrow morning to discover that a criminal used stolen credentials at midnight. The platform must capture the event, enrich it with trustworthy context, score it, explain the decision, and feed the outcome back into future models while the customer is still waiting at the checkout page.

This final case study follows a fictional payment processor called SecurePay. SecurePay authorizes card-present, e-commerce, wallet, and API transactions for merchants across several countries. Its executive team has one business goal: approve legitimate customers quickly while stopping high-risk transactions before settlement. The problem is strategically important because fraud losses are not abstract. The U.S. Federal Trade Commission reported that consumers lost more than $12.5 billion to fraud in 2024, a 25% increase over the previous year, and the share of fraud reports involving a monetary loss rose from 27% in 2023 to 38% in 2024.1

Chapter overview for the SecurePay fraud-detection case study. The editable source is stored beside the SVG as shared/assets/figures/ch22/01_chapter_overview.drawio.

Figure 1:Chapter overview for the SecurePay fraud-detection case study. The editable source is stored beside the SVG as shared/assets/figures/ch22/01_chapter_overview.drawio.

Fraud detection is a demanding capstone because it combines most ideas from this book. It needs event streaming, low-latency serving, batch history, feature stores, model training, observability, governance, cost trade-offs, and a practical lab. It is also an excellent reminder that machine learning succeeds only when the data platform is reliable. A clever model cannot compensate for missing events, leaking features, stale online values, or delayed labels that never return to the training pipeline.

Learning Objectives

By the end of this chapter, you should be able to translate fraud-detection business requirements into production data architecture. You should also be able to explain why fraud platforms require both streaming features and historical features, why class imbalance changes evaluation metrics, how feature stores reduce training-serving skew, and how feedback loops turn investigator decisions and chargebacks into future training data.

CapabilityWhat you should be able to do
Business translationConvert fraud-loss, customer-friction, and audit requirements into latency, availability, security, and explainability requirements.
Architecture designDesign a streaming-first platform using Kafka-style event streams, Flink-style stateful feature computation, online feature serving, model inference, and human review.
Feature engineeringSeparate point-in-time historical features from online velocity features and explain why this separation prevents leakage.
EvaluationInterpret precision, recall, false positives, false negatives, review volume, and business cost instead of relying on misleading accuracy.
GovernanceExplain how model versions, rule versions, reason codes, label lineage, and investigator outcomes make fraud decisions auditable.
Lab executionRun the shared Chapter 22 project and inspect a generated fraud-quality report.

22.1 Business Goals and Requirements

SecurePay serves merchants that expect authorization decisions in real time. A payment authorization is different from a batch analytics query because the decision is part of the customer experience. If the platform is slow, the merchant loses conversion. If the platform is too aggressive, good customers are embarrassed or blocked. If the platform is too permissive, SecurePay absorbs chargebacks, regulatory scrutiny, and reputational damage.

The useful way to structure the problem is to connect a use case, a pain point, and a solution. The use case is real-time transaction risk scoring for a multi-channel payment processor. The pain point is that fraudsters adapt faster than static rules, while manual review teams cannot investigate every suspicious event. The solution is a data platform that combines deterministic controls, learned risk scores, and a feedback loop that continuously converts outcomes into better features and models.

RequirementProduction meaningData engineering implication
Low-latency authorizationThe platform must return a decision before the authorization window expires.Keep the critical path small: event normalization, feature lookup, model score, and decision rules.
High throughputTraffic spikes during holidays, salary days, campaigns, merchant promotions, or fraud attacks.Partition event streams, scale stateless services horizontally, and size stateful stores for peak load.
High recall with controlled frictionMissed fraud creates direct financial loss, but false positives damage customer trust.Evaluate with precision, recall, false-positive rate, false-negative rate, review volume, and cost-weighted metrics.
ExplainabilityAnalysts, merchants, compliance teams, and customers may ask why a transaction was declined.Store decision inputs, model version, rule version, feature values, threshold, action, and reason codes.
Security and privacyPayment data is sensitive and may include regulated personal information.Apply encryption, access control, masking, audit logs, data minimization, and retention policies.
Learning feedbackChargebacks and investigator labels arrive after the original decision.Build durable label pipelines and point-in-time joins for retraining.

A common mistake is to describe this as “just a classification problem.” It is better described as a decisioning system. The classifier estimates risk, but the business action is chosen by a policy layer that considers score thresholds, merchant category, transaction amount, customer profile, authentication options, and investigator capacity.

A fraud platform is successful only when its decision policy is aligned with its data architecture. The risk score may be statistical, but the approve, decline, and manual-review actions are business controls.

22.2 The Challenges of Fraud Detection

Fraud detection is difficult because the data distribution is adversarial. Fraudsters observe defenses and change behavior. Legitimate customers also change behavior when they travel, receive salary, buy expensive devices, or switch merchants. The platform must therefore distinguish suspicious novelty from normal life.

The first challenge is class imbalance. The well-known anonymized European credit-card fraud dataset contains 492 frauds among 284,807 transactions, which means fraud represents only 0.172% of all transactions.2 A naive model that predicts “not fraud” for every transaction would appear extremely accurate, yet it would be useless. This is why fraud teams care about precision-recall behavior, alert volume, investigator capacity, and cost-weighted outcomes.

The second challenge is concept drift. Fraud patterns change as criminals test new merchants, devices, countries, mule accounts, synthetic identities, and transaction sizes. Research on credit-card fraud with delayed supervised information identifies concept drift, class imbalance, and delayed labels as central practical challenges.3 The engineering consequence is that the model-training dataset must preserve time, label delay, model version, and original decision context.

The third challenge is feature availability. Some features are naturally historical, such as average customer transaction amount over the last 90 days. Some are streaming features, such as the number of transactions from the same card in the last 30 minutes. Some are external signals, such as device reputation or merchant risk. The platform must know exactly which features were available at authorization time. If a training job accidentally uses a future chargeback label or a post-event balance, the offline model will look strong but fail in production.

ChallengeExampleEngineering response
Class imbalanceFraud is rare compared with legitimate transactions.Use precision-recall evaluation, stratified sampling, threshold tuning, and cost-sensitive analysis.
Concept driftA new fraud ring targets wallet transactions from a reused device cluster.Monitor feature distributions, alert precision, chargeback rates, and model-score drift.
Delayed labelsChargebacks may arrive days or weeks after authorization.Maintain a label store and retraining snapshots that respect event time.
Low-latency constraintsAuthorization cannot wait for a complex warehouse query.Pre-compute historical features and serve them from an online store.
Adversarial behaviorFraudsters test thresholds using small transactions.Combine model scores with rules, velocity counters, step-up authentication, and case investigation.
Regulatory and audit expectationsA declined transaction may need an explanation.Preserve lineage for input event, feature values, model version, rule version, and reason codes.

The key mental model is that fraud detection is not a single model. It is a closed-loop data product: events become features, features become decisions, decisions become labels, and labels become better future models.

22.3 Architecture and Technology Choices

SecurePay uses a streaming-first architecture. The critical path is intentionally narrow: receive the authorization event, retrieve online features, compute short-window features, call the model endpoint, apply policy rules, and return a decision. Heavy joins, historical aggregation, exploratory analysis, and retraining are moved out of this path.

Apache Kafka is a natural fit for the event backbone because Kafka is designed for publishing, subscribing to, storing, and processing streams of events.4 Apache Flink is a natural fit for stateful feature computation because stream processing continuously handles unbounded data as events arrive, and Flink’s model emphasizes event time, stateful operators, and fault-tolerant snapshots.5 A feature store such as Feast separates historical feature extraction from low-latency online serving; Feast describes offline stores for historical training data and online stores for production feature retrieval.6

Reference architecture for SecurePay real-time fraud detection. The editable source is shared/assets/figures/ch22/02_fraud_detection_reference_architecture.drawio.

Figure 2:Reference architecture for SecurePay real-time fraud detection. The editable source is shared/assets/figures/ch22/02_fraud_detection_reference_architecture.drawio.

The architecture has two speeds. The fast path handles authorization and must remain small enough to meet latency requirements. The learning path handles labels, analysis, retraining, evaluation, model release, and governance. This separation is critical. If the team adds too much logic to the fast path, customer experience suffers. If the team neglects the learning path, model quality decays.

LayerSecurePay componentResponsibility
IngestionKafka topics or equivalent event streamsStore transaction events, decisions, alerts, labels, and chargebacks as durable streams.
Streaming featuresFlink jobs or equivalent stream processorsCompute velocity counters, enrich events, perform online lookups, and emit decision-ready feature vectors.
Feature storeFeast registry, offline store, and online storeDefine feature contracts, materialize historical features, and serve current values with low latency.
Online storeRedis, Tair, DynamoDB, or another low-latency key-value storeProvide fast access to customer, card, merchant, device, and account features.
Model servingModel endpoint such as PAI-EAS, KServe, Seldon, or a managed equivalentReturn a calibrated fraud-risk score and model metadata.
DecisioningDecision APICombine risk score, rules, thresholds, authentication options, and manual-review routing.
FeedbackInvestigation workbench and label streamsCapture analyst decisions, customer disputes, chargebacks, and confirmed false positives.
Offline learningWarehouse, lakehouse, or training pipelineCreate point-in-time snapshots, train models, validate thresholds, and release governed model versions.

The technology names are less important than the boundaries. Kafka can be replaced by another durable event platform. Flink can be replaced by another stateful stream processor. Redis or Tair can be replaced by another online key-value store. The principle remains the same: do not ask a data warehouse to answer a millisecond authorization request, and do not train a model from features that differ from the features available online.

22.4 Key Implementation Details

Feature Stores and Point-in-Time Correctness

The feature store is the contract between data engineering and machine learning. It answers three questions. First, what does each feature mean? Second, how is it computed for historical training data? Third, how is the same feature served online during inference? Without that contract, teams often develop two pipelines: one SQL-heavy training pipeline and one application-specific serving pipeline. The result is training-serving skew.

Feature engineering pipeline showing how historical and streaming signals become offline and online feature products. The editable source is shared/assets/figures/ch22/03_feature_engineering_pipeline.drawio.

Figure 3:Feature engineering pipeline showing how historical and streaming signals become offline and online feature products. The editable source is shared/assets/figures/ch22/03_feature_engineering_pipeline.drawio.

SecurePay starts with a small but powerful feature set. These features are intentionally explainable because the first version of a fraud platform should be observable before it becomes sophisticated.

Feature familyExample featureComputation modeWhy it helps
Customer baselineAverage customer amount over prior transactionsBatch/offline snapshotLarge deviations from normal spending may indicate account takeover.
VelocityNumber of card transactions in the last 30 minutesStreaming stateFraud attacks often generate bursts before the card is blocked.
Device reputationNumber of prior customers seen on the same deviceStreaming and batch hybridMany customers on one device can indicate credential stuffing or mule activity.
Merchant riskPrior fraud or dispute rate for merchantBatch/offline snapshotSome merchants or merchant categories carry higher risk.
GeographyHome-country and merchant-country mismatchEvent enrichmentCross-border usage may be normal, but sudden mismatch increases risk.
Channel riskE-commerce or API transaction indicatorEvent enrichmentCard-not-present channels often have different risk characteristics from in-person transactions.

The most important implementation rule is point-in-time correctness. A training row for a transaction at 10:00 must use only information available at 10:00. If the feature pipeline uses labels, chargebacks, or future transactions from 10:05, it leaks information. Leakage is dangerous because it creates false confidence: validation looks excellent while production performance disappoints.

Decisioning Is More Than a Score

The model score estimates risk, but the decision API chooses an action. For a low-value transaction, SecurePay may approve and monitor. For a medium-risk transaction, it may request step-up authentication. For a high-risk transaction, it may decline immediately. For a transaction with uncertain model output but high business value, it may route the case to manual review.

This policy layer makes the platform safer. It allows business teams to change thresholds without retraining the model, compliance teams to enforce hard rules, and engineering teams to test model versions in shadow mode before full release. Every decision should store the model version, rule version, feature vector identifier, risk score, threshold, action, and reason codes.

ActionWhen it is appropriateData engineering requirement
ApproveLow risk and normal customer behavior.Persist the decision and score for future learning.
Step-up authenticationMedium risk where identity can be challenged.Integrate authentication outcome back into the label and decision streams.
Manual reviewHigh uncertainty or high-value transaction.Write a review case with reason codes and investigator-facing context.
DeclineHigh risk with strong policy or model evidence.Preserve explainable decision inputs for audit, appeal, and monitoring.

The Feedback Loop: The Most Important Pipeline

The feedback loop is the learning engine of the platform. It receives manual-review outcomes, confirmed customer disputes, chargebacks, merchant appeals, authentication results, and false-positive complaints. These labels are delayed and sometimes contradictory. An analyst may mark a case suspicious, the merchant may later provide evidence, and the final chargeback outcome may arrive weeks later.

SecurePay therefore treats labels as versioned facts rather than as simple columns. The label stream records who created the label, when it was created, which transaction it refers to, what evidence supported it, and whether it supersedes an earlier label. This allows the training pipeline to produce reproducible snapshots.

Model training, monitoring, and retraining loop for delayed labels and concept drift. The editable source is shared/assets/figures/ch22/04_model_training_retraining_loop.drawio.

Figure 4:Model training, monitoring, and retraining loop for delayed labels and concept drift. The editable source is shared/assets/figures/ch22/04_model_training_retraining_loop.drawio.

The retraining policy should not be purely calendar based. A weekly or monthly cadence is useful, but fraud teams should also retrain or recalibrate when monitoring shows score drift, declining alert precision, rising chargeback rate, or new fraud patterns in analyst notes. Conversely, every new model should pass governance checks before it replaces the current production version.

Monitoring signalWhat it detectsPossible action
Alert precisionWhether investigators are seeing too many false alerts.Adjust thresholds, improve features, or retrain the model.
False-positive complaintsWhether legitimate customers are blocked too often.Add step-up authentication, merchant-specific rules, or threshold exceptions.
Chargeback rateWhether fraud is escaping detection.Investigate missed segments and update features or rules.
Feature driftWhether input distributions changed.Compare online feature distributions against training baselines.
Model-score driftWhether score distribution changed without a business explanation.Trigger model review and calibration analysis.
Label delayWhether the training set is missing recent outcomes.Delay retraining snapshots or add delayed-label correction logic.

22.5 Production Design Pattern: Fast Path and Learning Path

The production design pattern for SecurePay is a dual-path fraud platform. The fast path protects the authorization experience. The learning path protects long-term model quality and auditability. Both paths must share identifiers, timestamps, feature definitions, and model metadata so that decisions can be replayed.

Design decisionRecommended approachReason
Event identityAssign a globally unique transaction ID at ingestion.Every feature, score, decision, label, and investigation note must join back to the original event.
Time semanticsUse event time for feature computation and processing time for operational monitoring.Fraud systems need both historical correctness and real-time availability metrics.
Feature definitionsStore feature contracts in version control and a registry.Training, serving, monitoring, and investigation must agree on feature meaning.
Model releasePromote models through offline evaluation, shadow scoring, canary release, and monitored rollout.A strong offline model can still fail under real traffic, latency, or segment drift.
Decision loggingLog score, model version, rule version, threshold, action, and reason codes.Compliance, appeals, model-risk management, and incident response require reproducible decisions.
Label governanceTreat labels as delayed, versioned, and source-attributed records.Chargebacks, investigator judgments, and customer disputes may arrive at different times and disagree.

This design pattern generalizes beyond payment fraud. Similar systems appear in account takeover detection, anti-money-laundering alerting, insurance claim triage, marketplace trust and safety, and platform abuse detection. The common engineering problem is to make a fast, explainable decision now while preserving enough evidence to learn later.

22.6 Guided Lab: Run the Shared SecurePay Project

The runnable case-study project is stored in the shared folder so that all language editions of the book can reuse it:

shared/projects/ch22_fraud_detection/

The lab uses only the Python standard library. It generates synthetic authorization events, builds point-in-time style features, applies a transparent baseline risk model, and writes a quality report. The goal is not to create a production-grade fraud model. The goal is to make the chapter concrete: readers can see how raw events become features, decisions, and operational metrics.

Shared project architecture for the runnable Chapter 22 fraud-detection lab. The editable source is shared/assets/figures/ch22/05_fraud_case_study_project_architecture.drawio.

Figure 5:Shared project architecture for the runnable Chapter 22 fraud-detection lab. The editable source is shared/assets/figures/ch22/05_fraud_case_study_project_architecture.drawio.

Run the lab from the repository root:

cd shared/projects/ch22_fraud_detection
python3 scripts/01_generate_transactions.py
python3 scripts/02_build_features.py
python3 scripts/03_score_transactions.py
python3 scripts/04_quality_report.py

A successful run writes data/bronze/transactions.csv, data/silver/transaction_features.csv, data/gold/fraud_decisions.csv, data/gold/confusion_matrix.csv, and reports/fraud_quality_report.md. The project README provides the full workflow, expected outputs, cleanup guidance, and troubleshooting notes. Extension exercises are available at shared/projects/ch22_fraud_detection/exercises/README.md, and the instructor solution is available at shared/solutions/ch22_fraud_detection/solution.md.

ScriptOutputEngineering concept
01_generate_transactions.pyRaw authorization-like events with delayed labels.Bronze event stream and class imbalance.
02_build_features.pyPoint-in-time feature rows.Feature engineering, velocity windows, and leakage prevention.
03_score_transactions.pyRisk scores, decisions, and confusion matrix.Model serving abstraction and policy thresholds.
04_quality_report.pyMarkdown quality report.Operational evaluation and business-cost interpretation.

After running the lab, open reports/fraud_quality_report.md. Then edit RISK_THRESHOLD and REVIEW_THRESHOLD in scripts/03_score_transactions.py. Rerun the scoring and reporting scripts. You should see the same trade-off that production fraud teams face: stricter thresholds may catch more fraud, but they also increase customer friction and review workload.

Common Pitfalls

The first pitfall is optimizing for accuracy. In an imbalanced fraud dataset, accuracy can be high even when the model misses nearly every fraud case. Use precision, recall, false-positive rate, false-negative rate, review volume, and business cost.

The second pitfall is allowing feature leakage. If the offline training dataset uses information that was unavailable at authorization time, the model will fail silently in production. Always build training rows with event-time semantics and auditable feature definitions.

The third pitfall is treating the model as the whole system. Fraud detection requires ingestion, identity resolution, feature computation, online serving, model governance, analyst tooling, monitoring, and feedback. A strong model without reliable pipelines is not a production control.

The fourth pitfall is ignoring human operations. Manual-review teams have finite capacity. If alert volume doubles without better precision, analysts become overloaded, investigation quality drops, and confirmed labels become slower and noisier.

The fifth pitfall is releasing models without decision replay. If the team cannot reconstruct what the model saw, which version scored it, which threshold was active, and why the policy selected an action, the organization cannot investigate incidents or satisfy audit expectations.

Exercises

  1. Change RISK_THRESHOLD and REVIEW_THRESHOLD in the shared project, then compare precision, recall, manual-review volume, and estimated operating cost across three configurations.

  2. Add a new feature called same_device_many_customers to 02_build_features.py, then update the scoring script to use it and document whether it changes false positives or false negatives.

  3. Extend the quality report so that it shows metrics by transaction channel: pos, ecommerce, wallet, and api.

  4. Design a Kafka topic plan for SecurePay. Include topics for authorization events, decisions, manual-review cases, confirmed labels, model-monitoring metrics, and model-release events.

  5. Write an architecture decision record explaining whether SecurePay should decline high-risk transactions directly, route them to step-up authentication, or send them to manual review.

  6. Propose a governance checklist for promoting a new fraud model from offline training to production. Include validation data, segment performance, drift monitoring, rollback, and approval evidence.

Review Questions

QuestionWhat a strong answer should include
Why is accuracy a misleading metric for fraud detection?Fraud is rare, so a model can look accurate while missing most fraud; precision, recall, cost, and review volume are more meaningful.
What is point-in-time correctness?Each training or scoring row uses only information available at the transaction event time.
Why does a fraud platform need both a fast path and a learning path?The fast path protects latency and customer experience; the learning path handles labels, retraining, evaluation, and governance.
How does a feature store reduce training-serving skew?It defines common feature contracts and separates offline historical retrieval from online low-latency serving while preserving semantic consistency.
What should be logged for each decision?Transaction ID, event time, features or feature vector ID, model version, rule version, risk score, threshold, action, and reason codes.
Why are fraud labels difficult to use?They are delayed, sometimes contradictory, and may arrive from multiple sources such as chargebacks, investigators, authentication outcomes, and customer disputes.

Chapter Summary

This chapter used SecurePay to connect the full data engineering lifecycle to a high-stakes business problem. Fraud detection is valuable because it forces us to design for low latency, high throughput, feature correctness, delayed labels, model governance, and measurable business impact. The architecture separates the fast authorization path from the slower learning path, uses a feature store to reduce training-serving skew, and treats feedback as a first-class data product.

The most important lesson is that real-time AI systems are not only about models. They are about data contracts, streaming state, online serving, reliable labels, monitoring, and human accountability. If you can reason through this case study, you can reason through many modern data products: recommendation engines, personalization systems, risk platforms, customer 360 systems, trust-and-safety systems, and AI search applications.

References