Quality Assurance Approach

This guide explains how quality assurance works across the roche-data platform — who does what, where rules live, how they’re enforced, and what happens when data fails a check.

Design Principles

Rules are code. Every quality rule lives in git, goes through PR review, and is tested before deployment. No separate “rule management” system.
Write once, enforce everywhere. Rules are written in CEL (Common Expression Language) and compiled to multiple execution targets — dbt tests for batch enforcement, Snowflake UDFs for interactive validation.
Four gates, not one. Data passes through four quality gates (G1–G4) as it moves from raw ingestion to certified consumption. Each gate serves a different purpose.
Shared library, entity bindings. Rules are generic and reusable. A valid-email rule works for any email field — entity stewards bind it to their specific properties.

The Four Quality Gates

Data quality is enforced through four gates embedded in the Snowflake medallion architecture. See ADR 0004 for the virtual view design.

DQ quality gates — G1 through G4 across the medallion architecture

Gate 1 — Completeness (Bronze)

Property	Value
Layer	Bronze (physical append-only table)
Checks	Format valid, types match schema, required fields present
Pass	Record lands in Bronze table
Fail	Entire file rejected; source gets error log
Enforcement	SQL predicates in Bronze DDL
Module	`rdt-model-store`

G1 is a technical gate. It answers: “Can we even read this data?” Records that fail G1 never enter the system.

Gate 2 — Validity (Silver)

Property	Value
Layer	Silver (virtual view over Bronze)
Checks	MRHub identity exists, referential integrity, no orphan records
Pass	Included in Silver view
Fail	Excluded by view predicate — visible in Bronze but filtered from Silver
Enforcement	CEL rules compiled to dbt test predicates + view WHERE clauses
Module	`rdt-model-store` (views), `rdt-model-policy` (rule compilation)

G2 validates business identity. A waste record referencing a site that doesn’t exist in MRHub stays in Bronze (data is preserved) but is excluded from Silver (it’s not trusted yet).

Gate 3 — Business Rules (Gold)

Property	Value
Layer	Gold (virtual view over Silver)
Checks	Range validation, cross-field consistency, freshness SLAs, domain rules
Pass	Included in Gold view; KPIs calculated
Fail	Excluded by view predicate; Metric Steward alerted
Enforcement	CEL rules compiled to dbt tests
Module	`rdt-model-policy`

G3 applies domain-specific business logic. A weight measurement outside the physically possible range, or a record older than the configured freshness SLA, is excluded from Gold.

Gate 4 — Consistency (Certified)

Property	Value
Layer	Certified (badge on Gold data)
Checks	Trend deviation vs. 30-day average, AI guardrail check
Pass	Certified for Cortex Analyst + dashboards
Fail	Visible with Warning badge
Enforcement	Statistical checks (future phase)
Module	TBD

G4 detects anomalies. Data that passes G1–G3 but shows a sudden 50% spike compared to historical trends gets flagged for review before reaching consumers.

The Rule Library

All DQ rules live in rules/library/, organized by tier and domain. See the DQ Rule Library Reference for the complete specification.

Three Rule Tiers

Tier	Dependencies	Example	Execution
Tier 1 — Atomic	None (self-contained)	`value.email.matches("^.*@roche\\.com$")`	dbt SQL + Python UDF + CLI
Tier 2 — Lookup	Snowflake reference tables	`value.vendor_code in ref.mrhub_vendors.codes`	dbt SQL + SQL UDF
Tier 3 — Business	Composes multiple checks	Weight within expected range based on recipe	dbt SQL + SQL UDF

Rule File Format

Each rule is a YAML file with a CEL expression:

id: valid-roche-email
version: 1.0.0
tier: 1
domain: format
semantic_types: [email, roche_email]
tags: [roche-100, pii]
description: Validates Roche corporate email format
cel: 'value.email.matches("^[a-zA-Z0-9._%+-]+@roche\\.com$")'
message: "Must be a valid @roche.com email address"

Current Library

The library currently contains 40 Tier 1 rules across 6 domains:

Domain	Rules	Examples
format	11	UUID, ISO date, email, phone, IBAN, GLN
string	8	not-empty, min/max length, no HTML tags, trimmed
range	6	positive, non-negative, percentage, temperature
type	5	is-not-null, is-string, is-number, is-boolean, is-integer
enum	6	country code, currency, language, Roche site status, waste type
temporal	4	date-not-future, date-not-before-2000, end-after-start

How Rules Are Bound to Entities

Entity stewards bind library rules to their entity’s properties in models/{entity}/rules.yaml:

rule_group: WasteTrackingDQ
entity_ref: "rtis://roche/entities/waste-tracking"
version: "2.0.0"

bindings:
  # Single-field: one property, one or more rules
  - property: waste_record_id
    rules: [valid-uuid]
  - property: waste_type
    rules: [roche-waste-type]
  - property: weight_kg
    rules: [positive-number]

  # Cross-field: multiple properties, one rule
  - properties: [end_date, start_date]
    rules: [end-after-start]

# Composite: combine multiple rules into a single pass/fail
composites:
  - id: waste-record-complete
    operator: AND
    requires: [valid-uuid, roche-waste-type, positive-number]
    message: "Waste record failed validation"

Compilation and Deployment

When you run rdt-model-policy compile --entity <name>, the system:

Loads all rules from rules/library/
Reads entity bindings from models/{entity}/rules.yaml
Transpiles each CEL expression to Snowflake SQL
Remaps field names (generic rule field -> bound entity property)
Generates a combined dbt test file per entity
Generates Snowflake UDFs for interactive validation

dbt Tests (Batch Enforcement)

Output: dbt/tests/{entity}/dq_all_rules.sql

The combined test file uses UNION ALL to check all rules in a single query. Each rule is labeled, so failures show which specific rules failed:

SELECT
    rule_id,
    COUNT(*) as failing_rows
FROM (
    SELECT 'valid-uuid' as rule_id, CAST(waste_record_id AS VARCHAR) as failing_value
    FROM {{ ref('waste_tracking_silver') }}
    WHERE NOT (
        REGEXP_LIKE(waste_record_id, '^[0-9a-f]{8}-...')
    )

    UNION ALL

    SELECT 'positive-number' as rule_id, CAST(weight_kg AS VARCHAR) as failing_value
    FROM {{ ref('waste_tracking_silver') }}
    WHERE NOT (
        weight_kg > 0
    )
) failed_checks
GROUP BY rule_id

dbt Cloud runs these tests at scale. The test passes if zero rows are returned (no failures). If rules fail, the output shows exactly which rules and how many rows.

For debugging, use --granular-tests to generate one file per rule instead.

Snowflake UDFs (Interactive Validation)

Output: snowflake/udf/dq_{rule_id}.sql

Each rule also generates a Snowflake function for single-value validation (used by the Steward Ratification UI):

-- Example: calling the UDF
SELECT DP_MDM.DQ_VALID_ROCHE_EMAIL('user@roche.com');
-- Returns: {"rule_id": "valid-roche-email", "passed": true, "message": ""}

Tier 1 rules use Python UDFs (wrapping celpy). Tier 2+ rules use SQL UDFs with subqueries against reference tables.

Testing Rules Locally

Before deploying, test rules locally using the CLI:

Single value test

rdt-model-policy eval-rule \
  --rule valid-roche-email \
  --value '{"email": "user@roche.com"}'

Output:

{
  "rule_id": "valid-roche-email",
  "passed": true,
  "message": ""
}

Batch test with fixtures

Create a fixture file with test cases:

{
  "cases": [
    {"input": {"email": "user@roche.com"}, "expected": true},
    {"input": {"email": "user@gmail.com"}, "expected": false},
    {"input": {"email": ""}, "expected": false}
  ]
}

Run:

rdt-model-policy test-rule \
  --rule valid-roche-email \
  --fixtures rules/fixtures/valid-roche-email-test.json