Quality Assurance Approach
This guide explains how quality assurance works across the roche-data platform — who does what, where rules live, how they’re enforced, and what happens when data fails a check.
Design Principles
Section titled “Design Principles”- Rules are code. Every quality rule lives in git, goes through PR review, and is tested before deployment. No separate “rule management” system.
- Write once, enforce everywhere. Rules are written in CEL (Common Expression Language) and compiled to multiple execution targets — dbt tests for batch enforcement, Snowflake UDFs for interactive validation.
- Four gates, not one. Data passes through four quality gates (G1–G4) as it moves from raw ingestion to certified consumption. Each gate serves a different purpose.
- Shared library, entity bindings. Rules are generic and reusable. A
valid-emailrule works for any email field — entity stewards bind it to their specific properties.
The Four Quality Gates
Section titled “The Four Quality Gates”Data quality is enforced through four gates embedded in the Snowflake medallion architecture. See ADR 0004 for the virtual view design.
Gate 1 — Completeness (Bronze)
Section titled “Gate 1 — Completeness (Bronze)”| Property | Value |
|---|---|
| Layer | Bronze (physical append-only table) |
| Checks | Format valid, types match schema, required fields present |
| Pass | Record lands in Bronze table |
| Fail | Entire file rejected; source gets error log |
| Enforcement | SQL predicates in Bronze DDL |
| Module | rdt-model-store |
G1 is a technical gate. It answers: “Can we even read this data?” Records that fail G1 never enter the system.
Gate 2 — Validity (Silver)
Section titled “Gate 2 — Validity (Silver)”| Property | Value |
|---|---|
| Layer | Silver (virtual view over Bronze) |
| Checks | MRHub identity exists, referential integrity, no orphan records |
| Pass | Included in Silver view |
| Fail | Excluded by view predicate — visible in Bronze but filtered from Silver |
| Enforcement | CEL rules compiled to dbt test predicates + view WHERE clauses |
| Module | rdt-model-store (views), rdt-model-policy (rule compilation) |
G2 validates business identity. A waste record referencing a site that doesn’t exist in MRHub stays in Bronze (data is preserved) but is excluded from Silver (it’s not trusted yet).
Gate 3 — Business Rules (Gold)
Section titled “Gate 3 — Business Rules (Gold)”| Property | Value |
|---|---|
| Layer | Gold (virtual view over Silver) |
| Checks | Range validation, cross-field consistency, freshness SLAs, domain rules |
| Pass | Included in Gold view; KPIs calculated |
| Fail | Excluded by view predicate; Metric Steward alerted |
| Enforcement | CEL rules compiled to dbt tests |
| Module | rdt-model-policy |
G3 applies domain-specific business logic. A weight measurement outside the physically possible range, or a record older than the configured freshness SLA, is excluded from Gold.
Gate 4 — Consistency (Certified)
Section titled “Gate 4 — Consistency (Certified)”| Property | Value |
|---|---|
| Layer | Certified (badge on Gold data) |
| Checks | Trend deviation vs. 30-day average, AI guardrail check |
| Pass | Certified for Cortex Analyst + dashboards |
| Fail | Visible with Warning badge |
| Enforcement | Statistical checks (future phase) |
| Module | TBD |
G4 detects anomalies. Data that passes G1–G3 but shows a sudden 50% spike compared to historical trends gets flagged for review before reaching consumers.
The Rule Library
Section titled “The Rule Library”All DQ rules live in rules/library/, organized by tier and domain. See the DQ Rule Library Reference for the complete specification.
Three Rule Tiers
Section titled “Three Rule Tiers”| Tier | Dependencies | Example | Execution |
|---|---|---|---|
| Tier 1 — Atomic | None (self-contained) | value.email.matches("^.*@roche\\.com$") | dbt SQL + Python UDF + CLI |
| Tier 2 — Lookup | Snowflake reference tables | value.vendor_code in ref.mrhub_vendors.codes | dbt SQL + SQL UDF |
| Tier 3 — Business | Composes multiple checks | Weight within expected range based on recipe | dbt SQL + SQL UDF |
Rule File Format
Section titled “Rule File Format”Each rule is a YAML file with a CEL expression:
id: valid-roche-emailversion: 1.0.0tier: 1domain: formatsemantic_types: [email, roche_email]tags: [roche-100, pii]description: Validates Roche corporate email formatcel: 'value.email.matches("^[a-zA-Z0-9._%+-]+@roche\\.com$")'message: "Must be a valid @roche.com email address"Current Library
Section titled “Current Library”The library currently contains 40 Tier 1 rules across 6 domains:
| Domain | Rules | Examples |
|---|---|---|
| format | 11 | UUID, ISO date, email, phone, IBAN, GLN |
| string | 8 | not-empty, min/max length, no HTML tags, trimmed |
| range | 6 | positive, non-negative, percentage, temperature |
| type | 5 | is-not-null, is-string, is-number, is-boolean, is-integer |
| enum | 6 | country code, currency, language, Roche site status, waste type |
| temporal | 4 | date-not-future, date-not-before-2000, end-after-start |
How Rules Are Bound to Entities
Section titled “How Rules Are Bound to Entities”Entity stewards bind library rules to their entity’s properties in models/{entity}/rules.yaml:
rule_group: WasteTrackingDQentity_ref: "rtis://roche/entities/waste-tracking"version: "2.0.0"
bindings: # Single-field: one property, one or more rules - property: waste_record_id rules: [valid-uuid] - property: waste_type rules: [roche-waste-type] - property: weight_kg rules: [positive-number]
# Cross-field: multiple properties, one rule - properties: [end_date, start_date] rules: [end-after-start]
# Composite: combine multiple rules into a single pass/failcomposites: - id: waste-record-complete operator: AND requires: [valid-uuid, roche-waste-type, positive-number] message: "Waste record failed validation"Compilation and Deployment
Section titled “Compilation and Deployment”When you run rdt-model-policy compile --entity <name>, the system:
- Loads all rules from
rules/library/ - Reads entity bindings from
models/{entity}/rules.yaml - Transpiles each CEL expression to Snowflake SQL
- Remaps field names (generic rule field -> bound entity property)
- Generates a combined dbt test file per entity
- Generates Snowflake UDFs for interactive validation
dbt Tests (Batch Enforcement)
Section titled “dbt Tests (Batch Enforcement)”Output: dbt/tests/{entity}/dq_all_rules.sql
The combined test file uses UNION ALL to check all rules in a single query. Each rule is labeled, so failures show which specific rules failed:
SELECT rule_id, COUNT(*) as failing_rowsFROM ( SELECT 'valid-uuid' as rule_id, CAST(waste_record_id AS VARCHAR) as failing_value FROM {{ ref('waste_tracking_silver') }} WHERE NOT ( REGEXP_LIKE(waste_record_id, '^[0-9a-f]{8}-...') )
UNION ALL
SELECT 'positive-number' as rule_id, CAST(weight_kg AS VARCHAR) as failing_value FROM {{ ref('waste_tracking_silver') }} WHERE NOT ( weight_kg > 0 )) failed_checksGROUP BY rule_iddbt Cloud runs these tests at scale. The test passes if zero rows are returned (no failures). If rules fail, the output shows exactly which rules and how many rows.
For debugging, use --granular-tests to generate one file per rule instead.
Snowflake UDFs (Interactive Validation)
Section titled “Snowflake UDFs (Interactive Validation)”Output: snowflake/udf/dq_{rule_id}.sql
Each rule also generates a Snowflake function for single-value validation (used by the Steward Ratification UI):
-- Example: calling the UDFSELECT DP_MDM.DQ_VALID_ROCHE_EMAIL('user@roche.com');-- Returns: {"rule_id": "valid-roche-email", "passed": true, "message": ""}Tier 1 rules use Python UDFs (wrapping celpy). Tier 2+ rules use SQL UDFs with subqueries against reference tables.
Testing Rules Locally
Section titled “Testing Rules Locally”Before deploying, test rules locally using the CLI:
Single value test
Section titled “Single value test”rdt-model-policy eval-rule \ --rule valid-roche-email \ --value '{"email": "user@roche.com"}'Output:
{ "rule_id": "valid-roche-email", "passed": true, "message": ""}Batch test with fixtures
Section titled “Batch test with fixtures”Create a fixture file with test cases:
{ "cases": [ {"input": {"email": "user@roche.com"}, "expected": true}, {"input": {"email": "user@gmail.com"}, "expected": false}, {"input": {"email": ""}, "expected": false} ]}Run:
rdt-model-policy test-rule \ --rule valid-roche-email \ --fixtures rules/fixtures/valid-roche-email-test.jsonOutput:
{ "rule_id": "valid-roche-email", "total": 3, "passed": 3, "failed": 0, "failures": []}Workflow: Adding a New Rule
Section titled “Workflow: Adding a New Rule”1. Author rule YAML → rules/library/tier1-atomic/{domain}/{rule-id}.yaml2. Write test fixture → rules/fixtures/{rule-id}-test.json3. Test locally → rdt-model-policy test-rule --rule {id} --fixtures {path}4. Update catalog → rdt-model-policy catalog5. Bind to entity → models/{entity}/rules.yaml6. Compile → rdt-model-policy compile --entity {name}7. PR review → git commit + gh pr create8. CI validation → rdt-model-validate all --entity {name}9. Merge → deploy → dbt Cloud runs tests; UDFs deployed to SnowflakeWorkflow: When Data Fails a Check
Section titled “Workflow: When Data Fails a Check”| Gate | What happens | Who acts |
|---|---|---|
| G1 fail | File rejected entirely. Source system receives error log. | Data engineer fixes source format. |
| G2 fail | Record stays in Bronze, excluded from Silver. | Entity steward investigates: missing reference data or bad source record? |
| G3 fail | Record stays in Silver, excluded from Gold. Metric Steward alerted. | Domain expert reviews: is the rule wrong, or is the data wrong? |
| G4 fail | Record visible in Gold with Warning badge. | Analyst reviews trend deviation. May be legitimate change or data issue. |
At every gate, the raw data is preserved — nothing is deleted. Failed records are always visible in the layer below for investigation.
Roles and Responsibilities
Section titled “Roles and Responsibilities”| Role | Responsibility |
|---|---|
| Platform team | Maintains the rule library, transpiler, and compilation pipeline |
| Entity steward | Binds rules to entity properties, reviews DQ results |
| Metric steward | Reviews G3 failures, decides if rule or data needs correction |
| Data engineer | Fixes G1/G2 source data issues |
| Auditor | Reviews rule definitions in git, DQ test results in dbt Cloud |
Architecture Decision
Section titled “Architecture Decision”The CEL-based approach was chosen over OPA (Kubernetes-deployed policy engine) per ADR 0013. Key reasons:
- Portability: Same CEL expression works in dbt (SQL), Snowflake UDF (Python/SQL), and CLI
- No infrastructure: No Kubernetes cluster needed for rule enforcement
- Shared library: Rules are reusable across entities, not redefined per entity
- Git-native: Rules, bindings, and compiled output are all version-controlled
See also: ADR 0004 — Virtual Medallion Architecture for how DQ gates integrate with the Bronze/Silver/Gold view hierarchy.
Related Documentation
Section titled “Related Documentation”- DQ Rule Library Reference — Complete specification of rule format, CEL syntax, and transpilation
- rdt-model-policy Module — CLI subcommands and usage
- Governance, Auditing & Compliance — How DQ fits the broader governance model
- For Data Stewards — Getting started as a steward