There's a specific kind of dread that comes from discovering that your production data has been silently corrupted for the past three weeks. Not a dramatic failure — everything is still running, jobs are completing, no alerts fired. But somewhere in the pipeline, a join is producing duplicates, or a null value is being defaulted to zero, and the downstream analytics have been quietly wrong this entire time.
I've seen this pattern enough times that I've started treating it as an inevitability rather than an edge case. The question isn't whether your data pipeline will develop data quality issues — it's whether you'll catch them before or after they've affected business decisions.
Schema Drift: The Slow Poison
Schema drift happens when the structure of incoming data changes without a corresponding update to the consuming pipeline. A source system adds a new column. A field that was always populated starts arriving as null. A numeric field changes from integer to float. Your pipeline happily continues processing, often silently casting, dropping, or misinterpreting the changed fields.
The reason schema drift is so dangerous is that it doesn't cause errors — it causes wrong results. Your pipeline runs to completion, your tables get updated, and everything looks normal at the infrastructure level. It's only when someone looks at the actual data that the problem surfaces.
The fix is schema validation at ingestion time. Before you process a batch, check that it conforms to the expected schema. Tools like Great Expectations make this easy — you define expectations about column names, types, and value ranges, and the framework validates incoming data against them automatically.
import great_expectations as gx
context = gx.get_context()
validator = context.get_validator(
datasource_name="orders_pipeline",
data_asset_name="daily_orders"
)
# Expectations that will catch common schema drift
validator.expect_column_to_exist("order_id")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_of_type("amount", "float")
validator.expect_column_values_to_be_between("amount", min_value=0)
results = validator.validate()
if not results["success"]:
raise PipelineValidationError(results)
This isn't glamorous, but it's the difference between catching schema drift on day one and discovering it three weeks later.
Late-Arriving Data and the Completeness Problem
In streaming and near-real-time pipelines, late-arriving data is a fact of life. A mobile app event that was generated while the user was offline arrives hours later. A batch file from a partner system lands six hours after its expected window. Your pipeline has already processed the relevant time window and moved on.
The testing challenge here is verifying completeness — not just that data arrived, but that the expected volume arrived within a reasonable window. Row count reconciliation is a simple but effective technique: compare the count of records expected (based on source system exports or prior patterns) against the count received. A significant deviation triggers an investigation.
Row count reconciliation doesn't require sophisticated tooling. A SQL query comparing source counts to destination counts, run as part of your daily pipeline health check, catches the majority of completeness issues before they affect downstream consumers.
For streaming pipelines, you want watermarking — a mechanism that tracks how far behind the current processing is from real time, and triggers reprocessing or alerting when late-arriving events are detected beyond a threshold.
Duplicate Records: Silent Revenue Killers
Duplicate records are the most financially dangerous data quality issue in pipelines that feed financial or transactional systems. A duplicate transaction record can inflate reported revenue. A duplicate user record can cause the same person to receive two marketing emails. A duplicate order record in a fulfillment system can trigger a double shipment.
Duplicates enter pipelines in several ways: exactly-once delivery is hard to guarantee in distributed systems, retry logic in ingestion creates duplicates when an operation partially succeeds, and at-least-once semantics (the default in most event streaming systems) explicitly allow duplicates.
Testing for duplicates means checking for uniqueness on your natural keys, not just on generated surrogate keys. Just because your database assigned unique auto-increment IDs doesn't mean the underlying business events are unique. Check COUNT(*) vs COUNT(DISTINCT business_key) regularly.
Referential Integrity Testing
One of the most common data quality failures I encounter in pipelines is broken referential integrity — records that reference IDs that don't exist in their parent tables. An order record with a customer_id that doesn't exist in the customers table. A transaction with a product_id that was deleted from the catalog.
Relational databases enforce this through foreign keys, but many data warehouses (Snowflake, BigQuery, Redshift) either don't enforce foreign key constraints or enforce them only at query time without blocking writes. That means your pipeline can happily write orphaned records without any error.
Testing referential integrity in your pipeline validation layer means explicitly checking that foreign key values resolve. A query like this should return zero rows:
-- Find orders with no matching customer SELECT COUNT(*) as orphaned_orders FROM orders o LEFT JOIN customers c ON o.customer_id = c.id WHERE c.id IS NULL;
Run this as part of your post-load validation. If it returns non-zero, fail the pipeline run and alert.
Fanout in Streaming Pipelines
Fanout is a less-discussed but genuinely tricky problem in streaming architectures. It occurs when a single input event triggers multiple downstream records — often intentionally, but sometimes not. A webhook that should produce one event per user action suddenly produces five, because someone changed the retry configuration upstream without realizing the downstream consumer wasn't idempotent.
I worked with a team where a fanout bug in their Kafka consumer caused their daily active user count to be overstated by roughly 4x for almost two weeks. The events were all real — they just weren't supposed to each be counted separately. The fix was a deduplication step in the consumer, but the testing lesson was to validate that fanout ratios stay within expected bounds.
Data Contract Testing for Pipelines
Contract testing, borrowed from API testing, is increasingly being applied to data pipelines. The idea is that producers and consumers of data agree on a contract — schema, semantics, and SLAs — and you test that both sides are honoring it.
For a data pipeline, this means the team producing data (say, the application engineering team) and the team consuming it (the data engineering or analytics team) formally define what the data should look like and what guarantees apply. When the producer changes something, they run the contract tests to verify they haven't broken consumers.
This is especially valuable in organizations where the application team and data team have different release cycles. Without contract testing, the data team discovers breaking changes when their pipeline fails — usually at 3am.
Testing at Every Stage, Not Just Output
The most important structural principle in data pipeline testing is to test at each transformation stage, not just at the final output. A pipeline that loads raw data, transforms it, enriches it with reference data, and aggregates it has at least four places where quality should be verified.
Testing only at the output means that when something is wrong, you have no visibility into which stage introduced the problem. Testing at each stage gives you immediate signal and makes root cause analysis orders of magnitude faster.
- Raw ingestion stage: Schema validation, row count vs source, no-null checks on required fields
- Transformation stage: Business rule validation, value range checks, uniqueness on keys
- Enrichment stage: Referential integrity, join success rate (what % of records failed to enrich)
- Aggregation/output stage: Total reconciliation against expected aggregates, comparison with prior periods for anomaly detection
None of these checks are individually sophisticated. The sophistication is in making them systematic and automatic — baked into the pipeline itself, not run manually when something looks off.
Data quality failures are rarely dramatic. They're quiet degradations that accumulate until someone asks a question the data can't honestly answer. Testing at each pipeline stage is the only way to catch them before that moment.
The tools are mature — Great Expectations, dbt tests, Apache Griffin, or even a well-organized set of SQL validation queries. What matters is the discipline to run them consistently and the willingness to fail a pipeline run rather than let bad data flow downstream.