Chapter 19: Feature Stores and Model Serving

In this chapter, you will design the production interface between machine learning and data engineering: a feature store that supplies point-in-time-correct features and a model-serving layer that turns trained models into monitored, versioned, low-latency services. By the end, you will be able to explain why strong offline metrics are not enough, build a small feature-serving contract, and validate whether online inference uses the same feature logic that was used during training.

Opening Scenario¶

TuranMart has trained a fraud-risk model on historical transaction data. In the experiment notebook, the model looks excellent: the validation AUC is high, precision at the manual-review threshold is acceptable, and the fraud operations team is eager to use the model before a seasonal sales campaign. The first production rollout, however, creates confusion. Some clearly risky customers receive low scores, while some normal customers are sent to manual review. The model artifact did not change, so the first suspicion is that the serving system is wrong.

The incident review reveals a deeper data-engineering problem. The training job used a warehouse query that joined each transaction to hourly customer behavior and daily merchant-risk aggregates. The online service, by contrast, looked up a Redis key that was materialized by a different job, on a different schedule, with different default values for missing features. The training query also joined to the most recent feature row before the transaction timestamp, but it ignored the ten-minute delay between a transaction event and the moment the upstream aggregation table became complete. The validation dataset therefore contained some feature values that would not have been available at production scoring time.

The stakeholders now need a safer design. The data engineering team wants reusable feature definitions rather than copy-pasted SQL. The ML engineering team wants a model endpoint that can fetch online features with predictable latency. The platform team wants canary deployment, rollback, metrics, and clear ownership. The fraud operations team wants a success criterion that is operational, not academic: candidate models may only be promoted when point-in-time joins are leakage-safe, online features are fresh, skew tests pass, and the model endpoint stays within its latency and error budget.

Chapter 19 overview from feature definitions to online inference. — Figure 1:Feature stores and model serving connect reusable feature definitions, offline training datasets, online feature lookup, inference endpoints, deployment controls, and monitoring.

Learning Objectives¶

After completing this chapter, you should be able to design a feature-store architecture that separates offline historical retrieval from online low-latency serving while preserving a shared feature definition. You should be able to explain point-in-time joins, source delay, temporal lookback, feature freshness, and training-serving skew. You should also be able to compare model-serving patterns such as batch scoring, online APIs, streaming inference, Kubernetes-native serving, optimized inference runtimes, and managed cloud serving. Finally, you should be able to validate a small feature-store and model-serving contract with explicit online payloads, skew-test evidence, rollout rules, monitoring thresholds, and rollback ownership.

Capability	What you should be able to do	Lab connection
Feature governance	Define entities, feature views, owners, data types, TTLs, and freshness SLOs.	Review `feature_store_contract.yml`.
Temporal correctness	Apply source delay and lookback windows to avoid future leakage.	Explain the contract fields `source_delay` and `temporal_join_lookback`.
Online serving	Describe how low-latency online features are retrieved by entity key.	Inspect `online_feature_payloads.jsonl`.
Skew control	Compare offline and online feature values with tolerances.	Validate `skew_test_cases.csv`.
Release engineering	Design shadow tests, canary rollout, monitoring, and rollback.	Review `serving_release_plan.yml`.

19.1 Why Feature Stores Exist¶

A feature is a measurable input used by a model, such as a customer’s transaction count in the last hour, a merchant’s dispute rate in the last thirty days, or a user’s average basket value. A feature store is a governed layer that defines, discovers, computes, stores, retrieves, and monitors those model inputs. Feast describes a feature store as an operational system with an offline store for historical feature extraction and an online store for low-latency production serving.^[1] Databricks similarly describes a feature store as a centralized repository that helps teams find and share features while reducing differences between training-time and inference-time computation.^[2]

The durable idea is that a feature should be treated as a data product with a serving contract. A production feature has an owner, an entity key, an event timestamp, a calculation definition, a data type, a freshness expectation, a defaulting policy, lineage, and consumers. Without that contract, features become invisible application dependencies. The model may look mathematical, but its behavior is controlled by operational data pipelines.

Concept	Definition	Why it matters in production
Entity	The business object for which features are retrieved, such as `customer_id`, `merchant_id`, or `product_id`.	The entity key links labels, offline features, online payloads, and predictions.
Feature view	A logical group of related features with shared keys, timestamps, and sources.	Grouping keeps definitions reusable and makes ownership explicit.
Offline store	Historical feature storage used for training, validation, backtesting, and batch scoring.	Training data must be reproducible and leakage-safe.
Online store	Low-latency feature storage used during real-time inference.	Online predictions depend on fresh and correctly keyed values.
Registry	Metadata catalog for entities, feature views, services, schemas, owners, and lineage.	Reviewers need a single source of truth for feature contracts.
Feature service	A model-specific group of features consumed together.	Feature services make it possible to version the inputs used by a model endpoint.

Feature store reference architecture with offline store, online store, registry, and serving layer. — Figure 2:A reference feature-store architecture separates historical retrieval, online serving, metadata, materialization, monitoring, and model consumption.

Feature stores do not remove the need for upstream pipelines. Feast explicitly notes that it is not a general-purpose ETL system, data orchestrator, warehouse, or database; it manages and serves feature values using existing infrastructure.^[1] This distinction matters. The data warehouse or lakehouse remains the source of historical truth. Stream and batch pipelines still compute aggregates. The feature store adds a contract layer that makes those computed values consistently available to training and serving systems.

19.2 Point-in-Time Correctness and Training-Serving Skew¶

The most important feature-store problem is not storage. It is time. A model trained to predict fraud at 10:00 may only use information that would have been available by 10:00. If a feature row was generated at 09:59 but did not arrive in the feature table until 10:08, then a production service scoring at 10:00 could not use it. A training dataset that includes it has leaked future information.

A point-in-time join, sometimes called an as-of join or temporal join, attaches each label row to the latest eligible feature values before the prediction timestamp. Microsoft Azure ML documentation frames point-in-time joins as a way to prevent leakage and defines the effective window using source_delay and temporal_join_lookback: for an observation time t, the selected feature must fall inside [t - temporal_join_lookback, t - source_delay].^[3] This small formula is one of the most useful mental models in ML data engineering.

Point-in-time join and training-serving skew explanation. — Figure 3:Point-in-time joins prevent future values from entering training rows, while skew tests compare the offline values used in training with the online values retrieved during serving.

A simple point-in-time selection can be expressed with a ranked join. The exact syntax differs by warehouse, but the intent is stable: filter to eligible features, rank by feature event time, and keep the latest row per prediction event.

WITH eligible_features AS (
    SELECT
        l.transaction_id,
        l.customer_id,
        l.scoring_ts,
        f.event_ts AS feature_event_ts,
        f.txn_count_1h,
        f.total_amount_1h,
        ROW_NUMBER() OVER (
            PARTITION BY l.transaction_id
            ORDER BY f.event_ts DESC
        ) AS feature_rank
    FROM labels l
    JOIN customer_velocity_hourly f
      ON l.customer_id = f.customer_id
     AND f.event_ts <= l.scoring_ts - INTERVAL '10 minutes'
     AND f.event_ts >= l.scoring_ts - INTERVAL '7 days'
)
SELECT *
FROM eligible_features
WHERE feature_rank = 1;
```text

Training-serving skew appears when the features used by the training job differ from the features used by the production service. The difference may be caused by separate code paths, inconsistent windows, missing online keys, stale materialization, type conversions, different default values, or untracked feature deprecations. The most practical defense is a skew test that samples entity keys, computes or retrieves expected offline values, retrieves online values, and fails if values, types, or timestamps diverge beyond policy.

| Failure mode | Example | Prevention |
|---|---|---|
| Future leakage | Training joins to a feature produced after the prediction time. | Use event timestamps, created timestamps, `source_delay`, and temporal lookback. |
| Stale online value | Redis contains yesterday's merchant risk for a model that requires hourly freshness. | Store feature timestamps and alert on TTL or freshness SLO violations. |
| Missing-key default | The service silently substitutes zero for a missing chargeback count. | Define per-feature default policy and log all fallback usage. |
| Code-path divergence | Notebook SQL and online service code compute different rolling windows. | Use shared feature definitions or generated serving code. |
| Schema drift | Offline feature changes from integer to decimal while online payload remains string. | Validate schema compatibility before promotion. |

## 19.3 Feature Store Architecture Patterns

A production feature store is usually implemented as a set of cooperating components rather than a single monolithic database. The offline store may be a warehouse or lakehouse table. The online store may be Redis, Tair, DynamoDB, Cassandra, Bigtable, or another low-latency key-value system. A registry records metadata. Materialization jobs copy or stream selected feature values from offline computation into online serving. Monitoring jobs check freshness, null rates, distribution drift, and skew.

| Design decision | Common default | When to choose differently |
|---|---|---|
| Offline store | Warehouse or lakehouse table partitioned by event date and keyed by entity. | Use a specialized feature platform when lineage, governance, or repeated training-set generation becomes a bottleneck. |
| Online store | Redis-like key-value store for tabular low-latency features. | Use Cassandra, Bigtable, HBase, or DynamoDB-style systems for very high write rates and large keyspaces. |
| Materialization | Scheduled batch for stable features; streaming for fast-changing features. | Use streaming only when freshness materially changes decisions or risk. |
| Feature grouping | One feature service per model version. | Use more granular services when multiple endpoints share only part of a feature set. |
| Defaulting policy | Reject critical missing features and log approved defaults. | Permit defaults only when business owners accept the decision impact. |
| Governance | Owner, description, lineage, TTL, and deprecation status for each feature. | Add approval workflows for regulated decisions, credit, insurance, or sensitive personal data. |

Feast's feature retrieval model illustrates this architecture well. Historical retrieval uses `get_historical_features(...)` for training data and batch scoring, while online retrieval uses `get_online_features(...)` or a feature server endpoint for real-time prediction.[^4] The important pattern is not the function name. The important pattern is that the model version should reference a stable group of features, and that group should work in both offline and online contexts.

The following pseudocode shows how teams often express this idea in a platform-independent way. The feature contract states the entity, event timestamp, source delay, lookback, online TTL, and owner. The serving contract states which model version consumes the feature service.

```yaml
feature_view: customer_velocity_1h
entity: customer_id
event_timestamp: event_ts
source_delay: PT10M
temporal_join_lookback: P7D
online_ttl: PT2H
features:
  - txn_count_1h
  - total_amount_1h
  - chargeback_count_30d
feature_service: fraud_risk_service_v3
consumer_model: fraud_risk_xgboost:2.5.0

This contract is intentionally boring. Boring contracts are valuable because they make production behavior reviewable. If a feature is too stale, too sparse, too expensive, too poorly documented, or too risky to default, that fact should appear before a model endpoint is promoted.

19.4 Model Serving Patterns¶

Model serving is the system boundary where model artifacts, feature values, request schemas, runtime dependencies, traffic policies, and monitoring meet. The right serving design depends on latency, throughput, model size, hardware, freshness, governance, and deployment risk.

Pattern	Typical use case	Strength	Main risk
Batch scoring	Nightly churn, scheduled credit pre-approval, offline recommendations.	Cheap, reproducible, and easy to audit.	Decisions are not real time.
Online API	Fraud scoring, personalization, search ranking, checkout decisions.	Supports interactive products and low-latency decisions.	Requires fresh online features, latency controls, and robust fallback policy.
Streaming inference	Event-driven anomaly detection and near-real-time monitoring.	Scores events as they flow through the platform.	Harder state management and operational debugging.
Embedded inference	Mobile, edge, or ultra-low-latency application logic.	Avoids network round trips.	Model update, observability, and governance become harder.
Managed serving	Cloud-managed endpoint with autoscaling and monitoring.	Reduces platform operations burden.	Can hide details that matter for cost, runtime tuning, and portability.

Production model-serving patterns including batch, online, streaming, and managed serving. — Figure 4:Production model serving spans batch jobs, online APIs, streaming consumers, Kubernetes-native endpoints, optimized runtimes, and managed cloud platforms.

Kubernetes-native serving platforms and inference runtimes solve different layers of the problem. KServe provides Kubernetes custom resources and serving patterns for model endpoints. Its Knative mode uses request-based autoscaling, scale-to-zero, revision management, and canary rollout, and the KServe documentation positions this mode as especially useful for predictive inference workloads with variable traffic.^[5] NVIDIA Triton Inference Server, by contrast, is an inference runtime. It supports multiple frameworks, HTTP/REST and gRPC, model repositories, per-model schedulers, readiness and liveness endpoints, and metrics for Kubernetes integration.^[6]

Triton’s dynamic batching is a useful example of a serving-runtime trade-off. Dynamic batching combines individual inference requests on the server to increase throughput for stateless models. NVIDIA recommends starting with a maximum batch size, enabling dynamic batching, measuring latency and throughput, and only then increasing batch size or queue delay while staying inside the latency budget.^[7] This is a production engineering decision, not a checkbox. A fraud endpoint with a 150 ms p95 SLO may accept a small queue delay; a user-facing authorization path may not.

Managed platforms combine several serving concerns into a product workflow. Alibaba Cloud PAI-EAS supports online inference services and AI-powered web applications, heterogeneous CPU/GPU resources, automatic and scheduled scaling, one-click stress testing, phased release, traffic mirroring, and real-time monitoring.^[8] A managed service can accelerate delivery, but the architecture review should still ask the same questions: where do features come from, how are models versioned, how is canary traffic controlled, which metrics trigger rollback, and how are costs governed?

19.5 Guided Lab: Build a Feature Store and Model Serving Contract¶

The guided lab for this chapter asks you to review and validate a small contract for TuranMart’s fraud-risk model. The lab does not require a running Redis cluster, Feast server, Triton server, or Kubernetes cluster. Instead, it focuses on the artifacts that real systems need before deployment: a feature contract, online payload examples, skew-test evidence, and a serving release plan.

Guided lab architecture for feature contract, online payloads, skew tests, and model serving release plan. — Figure 5:The Chapter 19 lab validates the contract between offline features, online feature payloads, skew tests, and model-serving release controls.

Lab material	Link	Purpose
Lab README	README	Explains the workflow, expected output, cleanup, and troubleshooting steps.
Feature-store contract	feature_store_contract.yml	Defines entities, feature views, source delay, lookback, TTL, ownership, and quality gates.
Online payloads	online_feature_payloads.jsonl	Shows sample online-store records that a model endpoint would retrieve by entity key.
Skew-test cases	skew_test_cases.csv	Compares offline and online feature values with tolerances and review evidence.
Serving release plan	serving_release_plan.yml	Specifies runtime, latency SLO, batching, shadow test, canary rollout, monitoring, and rollback.
Validator	validate_feature_serving_lab.py	Performs dependency-free structural checks on the lab artifacts.
Extension exercises	exercises/README.md	Proposes optional Feast, Triton, KServe, PAI-EAS, and scheduled skew-test extensions.
Instructor solution	solution.md	Provides grading guidance and reference interpretation.

Start by reading feature_store_contract.yml. Confirm that the customer_velocity_1h and merchant_risk_daily feature views both declare event timestamps, created timestamps, source delays, lookback windows, online stores, TTLs, freshness SLOs, owners, and default policies. The fields are there because a reviewer should not need to inspect application code to determine whether a feature can be used safely at prediction time.

Next, inspect online_feature_payloads.jsonl. Each record includes an entity type, entity key, model feature service, feature view, feature event timestamp, materialization timestamp, and feature values. A real online store may encode this information differently, but the endpoint should still be able to answer two questions: which feature value did the model receive, and how fresh was it?

Then inspect skew_test_cases.csv. Notice that the file includes both PASS and REVIEW results. This is intentional. A mature skew test is not a demonstration that the happy path works once. It is an operational gate that detects differences, routes them to owners, and blocks promotion when feature values diverge beyond tolerance.

Finally, inspect serving_release_plan.yml. The plan separates serving-platform concerns from runtime concerns. KServe-style rollout controls manage revisions, traffic, and rollback. Triton-style runtime controls manage inference execution and batching. Monitoring links technical metrics such as p95 latency, HTTP error rate, feature missing rate, freshness, and skew failure rate to promotion or rollback decisions.

Run the validator from the lab directory:

cd shared/labs/ch19_feature_store_model_serving
python3 tests/validate_feature_serving_lab.py
```text

Expected output:

```text
PASS validate_feature_store_contract
PASS validate_online_payloads
PASS validate_skew_test_cases
PASS validate_serving_release_plan
All Chapter 19 feature store and model serving checks passed.

If the validator fails, read the failure message as a review comment. For example, a missing source_delay means the feature contract no longer states how ingestion latency is handled. A missing rollback term means the serving plan does not tell operators what to do when a candidate model harms latency, quality, or business KPIs.

19.6 Common Pitfalls and Operational Lessons¶

Feature stores and model-serving systems fail most often at team boundaries. Data engineers may assume the online service can tolerate late data. ML engineers may assume that online values match the training dataset. Platform engineers may optimize throughput without understanding that a product decision requires fresh features. Operations teams may see a business KPI regress without knowing which model, feature, or deployment step changed.

Pitfall	Symptom	Root cause	Prevention
Training-serving skew	Offline validation is strong but production quality drops.	Offline and online paths compute or retrieve different values.	Use shared definitions, online/offline comparison jobs, and promotion-blocking skew tests.
Data leakage	Validation metrics look unrealistically good.	Future information enters training rows.	Use event timestamps, created timestamps, source delay, and point-in-time joins.
Stale online features	API responds successfully but decisions use old data.	Materialization job failed or TTL was not monitored.	Store feature timestamps and alert on freshness SLOs.
Feature explosion	Hundreds of undocumented features accumulate.	No ownership, registry, or deprecation policy.	Require owners, descriptions, consumers, and retirement plans.
Slow inference	P95 latency exceeds the product SLO.	Feature fan-out, model runtime, network distance, or inefficient batching.	Measure end-to-end latency and tune runtime, cache, and topology together.
Unsafe fallback	Missing features silently default to zero.	Default policy was not reviewed with business owners.	Define per-feature default behavior and log all fallback usage.
Uncontrolled rollout	A new model harms users immediately.	No shadow test, canary step, or rollback threshold.	Use staged rollout, prediction logging, traffic mirroring, and automatic rollback.

A reliable design begins with five questions. First, what is the entity and prediction time? Second, which feature values would truly be available at that moment? Third, how will the same definition be used for training and serving? Fourth, what is the latency and freshness budget? Fifth, how will the team detect and repair drift, skew, stale features, and bad deployments?

19.7 Exercises¶

Level	Exercise
Easy	Choose one model from your organization or from a public example. Identify five features that should belong in a feature store. For each feature, write the entity key, timestamp, owner, expected freshness, default policy, and whether it is needed online.
Easy	Write a point-in-time join in SQL or Python for a label table and a feature table. Add a `source_delay` parameter and demonstrate how the training dataset changes when the delay increases.
Medium	Design an online feature schema for Redis, Tair, DynamoDB, or another key-value store. Include key naming conventions, TTL policy, feature timestamps, and fallback behavior for missing values.
Medium	Compare KServe, Triton, a simple FastAPI service, and Alibaba Cloud PAI-EAS for a tabular fraud model and an image embedding model. Explain which runtime and platform you would choose for each case.
Challenge	Implement a skew test that compares the latest offline feature values with online served values for 100 sampled entities. The test should fail on missing keys, type mismatch, stale timestamps, and numeric differences beyond tolerance.
Team	Run a mock architecture review. Assign one person to data engineering, one to ML engineering, one to platform operations, and one to business ownership. Approve the model only if every stakeholder can explain the feature contract, serving plan, monitoring rules, and rollback path.

19.8 Review Questions¶

Why is a feature store more than a shared table of model inputs?
How do source_delay and temporal_join_lookback reduce the risk of data leakage?
What is the difference between an offline store and an online store?
Why should a feature service or model-specific feature group be versioned with the model endpoint?
Which symptoms suggest training-serving skew rather than a weak model architecture?
When is Triton dynamic batching useful, and when might it violate a product latency requirement?
How do shadow testing, canary rollout, traffic mirroring, and rollback thresholds reduce deployment risk?
What information should be logged with each online prediction to make later debugging and audit possible?

Chapter Summary¶

Feature stores and model serving form the production interface between data engineering and machine learning. A feature store turns feature engineering into reusable, governed, point-in-time-correct data products. Its offline store supports training and batch inference. Its online store supports low-latency inference. Its registry and feature services provide the metadata contract that makes reuse possible. The most important feature-store responsibility is consistency: the values used during training must represent the values that would have been available during serving.

Model serving turns trained artifacts into operational systems. Batch scoring is efficient for scheduled decisions. Online APIs serve interactive products. Streaming inference fits event-driven detection. Kubernetes-native serving platforms manage deployment, scaling, traffic, and revision controls. Inference runtimes optimize model execution and batching. Managed cloud services package serving, stress testing, autoscaling, phased release, and monitoring into an operational workflow.

In the next chapter, you will continue the AI data engineering journey by studying vector databases and embeddings. That topic extends the same production theme: models become useful when their data representations are reliable, searchable, governed, and connected to serving systems.

References¶

Footnotes¶