Skip to content

Quality Assurance Approach

This guide explains how quality assurance works across the roche-data platform — who does what, where rules live, how they’re enforced, and what happens when data fails a check.

  1. Rules are code. Every quality rule lives in git, goes through PR review, and is tested before deployment. No separate “rule management” system.
  2. Write once, enforce everywhere. Rules are written in CEL (Common Expression Language) and compiled to multiple execution targets — dbt tests for batch enforcement, Snowflake UDFs for interactive validation.
  3. Four gates, not one. Data passes through four quality gates (G1–G4) as it moves from raw ingestion to certified consumption. Each gate serves a different purpose.
  4. Shared library, entity bindings. Rules are generic and reusable. A valid-email rule works for any email field — entity stewards bind it to their specific properties.

Data quality is enforced through four gates embedded in the Snowflake medallion architecture. See ADR 0004 for the virtual view design.

DQ quality gates — G1 through G4 across the medallion architecture

PropertyValue
LayerBronze (physical append-only table)
ChecksFormat valid, types match schema, required fields present
PassRecord lands in Bronze table
FailEntire file rejected; source gets error log
EnforcementSQL predicates in Bronze DDL
Modulerdt-model-store

G1 is a technical gate. It answers: “Can we even read this data?” Records that fail G1 never enter the system.

PropertyValue
LayerSilver (virtual view over Bronze)
ChecksMRHub identity exists, referential integrity, no orphan records
PassIncluded in Silver view
FailExcluded by view predicate — visible in Bronze but filtered from Silver
EnforcementCEL rules compiled to dbt test predicates + view WHERE clauses
Modulerdt-model-store (views), rdt-model-policy (rule compilation)

G2 validates business identity. A waste record referencing a site that doesn’t exist in MRHub stays in Bronze (data is preserved) but is excluded from Silver (it’s not trusted yet).

PropertyValue
LayerGold (virtual view over Silver)
ChecksRange validation, cross-field consistency, freshness SLAs, domain rules
PassIncluded in Gold view; KPIs calculated
FailExcluded by view predicate; Metric Steward alerted
EnforcementCEL rules compiled to dbt tests
Modulerdt-model-policy

G3 applies domain-specific business logic. A weight measurement outside the physically possible range, or a record older than the configured freshness SLA, is excluded from Gold.

PropertyValue
LayerCertified (badge on Gold data)
ChecksTrend deviation vs. 30-day average, AI guardrail check
PassCertified for Cortex Analyst + dashboards
FailVisible with Warning badge
EnforcementStatistical checks (future phase)
ModuleTBD

G4 detects anomalies. Data that passes G1–G3 but shows a sudden 50% spike compared to historical trends gets flagged for review before reaching consumers.

All DQ rules live in rules/library/, organized by tier and domain. See the DQ Rule Library Reference for the complete specification.

TierDependenciesExampleExecution
Tier 1 — AtomicNone (self-contained)value.email.matches("^.*@roche\\.com$")dbt SQL + Python UDF + CLI
Tier 2 — LookupSnowflake reference tablesvalue.vendor_code in ref.mrhub_vendors.codesdbt SQL + SQL UDF
Tier 3 — BusinessComposes multiple checksWeight within expected range based on recipedbt SQL + SQL UDF

Each rule is a YAML file with a CEL expression:

rules/library/tier1-atomic/format/valid-roche-email.yaml
id: valid-roche-email
version: 1.0.0
tier: 1
domain: format
semantic_types: [email, roche_email]
tags: [roche-100, pii]
description: Validates Roche corporate email format
cel: 'value.email.matches("^[a-zA-Z0-9._%+-]+@roche\\.com$")'
message: "Must be a valid @roche.com email address"

The library currently contains 40 Tier 1 rules across 6 domains:

DomainRulesExamples
format11UUID, ISO date, email, phone, IBAN, GLN
string8not-empty, min/max length, no HTML tags, trimmed
range6positive, non-negative, percentage, temperature
type5is-not-null, is-string, is-number, is-boolean, is-integer
enum6country code, currency, language, Roche site status, waste type
temporal4date-not-future, date-not-before-2000, end-after-start

Entity stewards bind library rules to their entity’s properties in models/{entity}/rules.yaml:

rule_group: WasteTrackingDQ
entity_ref: "rtis://roche/entities/waste-tracking"
version: "2.0.0"
bindings:
# Single-field: one property, one or more rules
- property: waste_record_id
rules: [valid-uuid]
- property: waste_type
rules: [roche-waste-type]
- property: weight_kg
rules: [positive-number]
# Cross-field: multiple properties, one rule
- properties: [end_date, start_date]
rules: [end-after-start]
# Composite: combine multiple rules into a single pass/fail
composites:
- id: waste-record-complete
operator: AND
requires: [valid-uuid, roche-waste-type, positive-number]
message: "Waste record failed validation"

When you run rdt-model-policy compile --entity <name>, the system:

  1. Loads all rules from rules/library/
  2. Reads entity bindings from models/{entity}/rules.yaml
  3. Transpiles each CEL expression to Snowflake SQL
  4. Remaps field names (generic rule field -> bound entity property)
  5. Generates a combined dbt test file per entity
  6. Generates Snowflake UDFs for interactive validation

Output: dbt/tests/{entity}/dq_all_rules.sql

The combined test file uses UNION ALL to check all rules in a single query. Each rule is labeled, so failures show which specific rules failed:

SELECT
rule_id,
COUNT(*) as failing_rows
FROM (
SELECT 'valid-uuid' as rule_id, CAST(waste_record_id AS VARCHAR) as failing_value
FROM {{ ref('waste_tracking_silver') }}
WHERE NOT (
REGEXP_LIKE(waste_record_id, '^[0-9a-f]{8}-...')
)
UNION ALL
SELECT 'positive-number' as rule_id, CAST(weight_kg AS VARCHAR) as failing_value
FROM {{ ref('waste_tracking_silver') }}
WHERE NOT (
weight_kg > 0
)
) failed_checks
GROUP BY rule_id

dbt Cloud runs these tests at scale. The test passes if zero rows are returned (no failures). If rules fail, the output shows exactly which rules and how many rows.

For debugging, use --granular-tests to generate one file per rule instead.

Output: snowflake/udf/dq_{rule_id}.sql

Each rule also generates a Snowflake function for single-value validation (used by the Steward Ratification UI):

-- Example: calling the UDF
SELECT DP_MDM.DQ_VALID_ROCHE_EMAIL('user@roche.com');
-- Returns: {"rule_id": "valid-roche-email", "passed": true, "message": ""}

Tier 1 rules use Python UDFs (wrapping celpy). Tier 2+ rules use SQL UDFs with subqueries against reference tables.

Before deploying, test rules locally using the CLI:

Terminal window
rdt-model-policy eval-rule \
--rule valid-roche-email \
--value '{"email": "user@roche.com"}'

Output:

{
"rule_id": "valid-roche-email",
"passed": true,
"message": ""
}

Create a fixture file with test cases:

{
"cases": [
{"input": {"email": "user@roche.com"}, "expected": true},
{"input": {"email": "user@gmail.com"}, "expected": false},
{"input": {"email": ""}, "expected": false}
]
}

Run:

Terminal window
rdt-model-policy test-rule \
--rule valid-roche-email \
--fixtures rules/fixtures/valid-roche-email-test.json

Output:

{
"rule_id": "valid-roche-email",
"total": 3,
"passed": 3,
"failed": 0,
"failures": []
}
1. Author rule YAML → rules/library/tier1-atomic/{domain}/{rule-id}.yaml
2. Write test fixture → rules/fixtures/{rule-id}-test.json
3. Test locally → rdt-model-policy test-rule --rule {id} --fixtures {path}
4. Update catalog → rdt-model-policy catalog
5. Bind to entity → models/{entity}/rules.yaml
6. Compile → rdt-model-policy compile --entity {name}
7. PR review → git commit + gh pr create
8. CI validation → rdt-model-validate all --entity {name}
9. Merge → deploy → dbt Cloud runs tests; UDFs deployed to Snowflake
GateWhat happensWho acts
G1 failFile rejected entirely. Source system receives error log.Data engineer fixes source format.
G2 failRecord stays in Bronze, excluded from Silver.Entity steward investigates: missing reference data or bad source record?
G3 failRecord stays in Silver, excluded from Gold. Metric Steward alerted.Domain expert reviews: is the rule wrong, or is the data wrong?
G4 failRecord visible in Gold with Warning badge.Analyst reviews trend deviation. May be legitimate change or data issue.

At every gate, the raw data is preserved — nothing is deleted. Failed records are always visible in the layer below for investigation.

RoleResponsibility
Platform teamMaintains the rule library, transpiler, and compilation pipeline
Entity stewardBinds rules to entity properties, reviews DQ results
Metric stewardReviews G3 failures, decides if rule or data needs correction
Data engineerFixes G1/G2 source data issues
AuditorReviews rule definitions in git, DQ test results in dbt Cloud

The CEL-based approach was chosen over OPA (Kubernetes-deployed policy engine) per ADR 0013. Key reasons:

  • Portability: Same CEL expression works in dbt (SQL), Snowflake UDF (Python/SQL), and CLI
  • No infrastructure: No Kubernetes cluster needed for rule enforcement
  • Shared library: Rules are reusable across entities, not redefined per entity
  • Git-native: Rules, bindings, and compiled output are all version-controlled

See also: ADR 0004 — Virtual Medallion Architecture for how DQ gates integrate with the Bronze/Silver/Gold view hierarchy.