Solution Architecture
| Field | Value |
|---|---|
| Document Title | RDT MODEL — Solution Architecture |
| Version | 1.0 |
| Date | 2026-05-03 |
| Status | Draft |
| Classification | Roche Internal |
Authorship
Section titled “Authorship”| Role | Name |
|---|---|
| Author | Sebastian Streit |
| Reviewer | Xavier Gutierrez |
| Approver | Nick Perry |
| Approver | Paulina Maria Swiecicka |
Change History
Section titled “Change History”| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-05-03 | Sebastian Streit | Initial draft — all sections |
1. Purpose
Section titled “1. Purpose”This document describes the solution architecture for RDT MODEL — a data infrastructure compiler built as a Rust CLI platform for Roche Global IT. The system takes an RTiS ontology model as input and produces a complete, certified data product as output: all Snowflake layers (Bronze/Silver/Gold/Semantic), data contract, OPA policies, SDK, MCP tool, API specification, documentation, and audit trail.
A single command — rdt-model-compile run --entity <name> — orchestrates 18 specialised CLI modules across 6 pipeline phases to deliver every artifact required for a data product to be discoverable, governed, quality-assured, and AI-ready.
1.1 Scope
Section titled “1.1 Scope”This architecture covers:
- The 18-module CLI pipeline and its orchestration model
- All 21 external system integrations (sources, targets, bidirectional)
- The Snowflake medallion architecture (Bronze/Silver/Gold/Semantic)
- Data quality enforcement (OPA real-time + dbt batch, gates G1–G4)
- The shared Rust library (
rdt-model-common) and cross-cutting patterns - Two Streamlit in Snowflake UI applications (CRUD, Ratification)
- Physical infrastructure: Snowflake, CaaS Kubernetes, Vault, GitHub Actions
- Environment strategy: dev/test/prod through configuration, not separate systems
1.2 Assumptions
Section titled “1.2 Assumptions”| # | Assumption |
|---|---|
| A1 | RTiS is the canonical source of truth for all data model definitions. Every data product originates from an RTiS entity. |
| A2 | Snowflake is the sole target analytics platform. All medallion layers deploy to a single Snowflake account with schema-level isolation per environment. |
| A3 | Stub-first development: modules implement the full interface using stub clients until access tasks (A01–A19) are resolved. The pipeline is runnable in --dry-run mode without credentials. |
| A4 | All environments (dev/test/prod) share the same physical infrastructure. Separation is achieved through configuration (schema prefixes, K8s namespaces, Vault paths). |
| A5 | GitHub Actions is the CI/CD platform. All deployment flows run through GitHub Actions workflows. |
| A6 | Collibra is the enterprise data governance platform. Stewardship metadata is pulled at generation time; lineage is pushed at deployment time. |
| A7 | PingFederate (via Snowflake WAM) is the OAuth identity provider for all Snowflake access. |
1.3 Constraints
Section titled “1.3 Constraints”| # | Constraint | Impact |
|---|---|---|
| C1 | Access tasks (A01–A19) block live integrations with external systems. Until resolved, all modules use StubClient implementations returning fixture data. | Modules are developed and tested against stubs; live integration is a configuration change, not a code change. |
| C2 | Single Snowflake account for all environments. No separate accounts for dev/test/prod. | Environment isolation relies on schema naming conventions (DEV_BRONZE, TEST_BRONZE, PROD_BRONZE). |
| C3 | Roche VPN required for RTiS, GUPRI, MRHub, and Vault access. GitHub Actions runners must be on-network. | CI/CD runners must be self-hosted or use Roche’s VPN-connected runner pool. |
| C4 | Collibra deployment model (on-prem vs. cloud) not yet confirmed. Network path may require async batch sync. | Architecture supports both real-time REST and batch file exchange patterns. |
| C5 | Rust expertise required for CLI development. | Mitigated by LLM-assisted development and comprehensive ADR documentation. |
1.4 Related Documents
Section titled “1.4 Related Documents”| Document | Location |
|---|---|
| ADR 0001 — Project Vision | adr/0001-project-vision.md |
| ADR 0007 — Data Product Lifecycle | adr/0007-data-product-lifecycle.md |
| ADR 0009 — Module I/O Contracts | adr/0009-module-io-contracts.md |
| ADR 0010 — Environment Strategy | adr/0010-environment-strategy.md |
| ADR 0011 — Pipeline Restructure | adr/0011-pipeline-restructure.md |
| Pipeline Overview | docs/src/content/docs/architecture/pipeline-overview.md |
2. Definitions
Section titled “2. Definitions”| Term | Definition |
|---|---|
| RTiS | Roche Terminology and Information Services — the canonical source of data model definitions, ontologies, terminologies, and synonyms. Deployed on AWS behind Roche VPN. |
| GUPRI | Globally Unique Persistent Roche Identifier — a persistent identifier system that assigns resolvable URIs to every artifact. Ensures global uniqueness across Roche systems. |
| Collibra | Enterprise data governance platform providing stewardship, ownership, classification, SLAs, PII flags, and lineage tracking. Bidirectional: provides metadata at generation, receives lineage on deployment. |
| Data Product | The complete output of one pipeline run for one entity: all Snowflake layers, data contract, policies, SDK, MCP tool, API spec, documentation, and audit trail. There is no partial product. |
| Medallion Architecture | A layered data architecture pattern: Bronze (raw, append-only), Silver (curated, validity-checked), Gold (business-ready, rule-checked), Semantic (AI-queryable, metric definitions). See ADR 0004. |
| OPA | Open Policy Agent — a general-purpose policy engine. Used here for real-time data quality enforcement and access control, deployed as containers on CaaS Kubernetes. |
| Rego | The declarative policy language used by OPA. Generated from YAML rule definitions by rdt-model-policy. |
| dbt | Data Build Tool — a SQL-first transformation framework. Used here for batch data quality enforcement and view materialisation in Snowflake. |
| Data Contract | A machine-readable specification (datacontract.com 1.1.0) defining the schema, SLA, quality expectations, and ownership of a data product. Generated by rdt-model-contract. |
| MCP | Model Context Protocol — an open standard for AI tool definitions. Generated MCP tools expose Gold data products to AI agents (Cortex Analyst, Claude). |
| CaaS | Container as a Service — Roche’s managed Kubernetes platform (Rancher-based). Hosts OPA policy containers and bundle refresh jobs. |
| Cortex Analyst | Snowflake’s AI-powered natural language query engine. Consumes Semantic view definitions to answer business questions in natural language. |
| PingFederate | Roche’s enterprise identity provider. All OAuth flows (including Snowflake WAM) route through PingFederate for authentication. |
| WAM | Web Access Management — Snowflake’s OAuth integration layer that delegates authentication to PingFederate via client_credentials grant. |
| DQ Gate | Data Quality Gate — one of four mandatory quality checkpoints (G1–G4) that data passes through before certification. Each gate has specific checks and failure consequences. |
| Stub Client | A test implementation of a system integration trait that returns fixture data from local JSON files. Enables full pipeline execution without live credentials. |
| Entity | A logical data object defined in RTiS (e.g., “waste-tracking”, “site-energy”). The unit of work for the pipeline — one entity produces one complete data product. |
| MRHub | Master Reference Hub — Roche’s master data system providing reference data for G2 validity checks and Solace change events. |
| Solace | Enterprise event bus for publishing data product lifecycle events (creation, update, deprecation). |
| Sinequa | Enterprise search engine. Receives offline documentation for data product discovery across Roche. |
| Mulesoft | API management platform (Anypoint). Publishes generated OpenAPI specifications as managed, governed APIs. |
| Snowflake Horizon | Snowflake’s cross-account data governance and discovery layer. Used for registering data products for cross-account access. |
3. Current State (AS-IS)
Section titled “3. Current State (AS-IS)”No formal data product architecture exists today. The current state across Roche data domains is characterised by:
Manual, linear process. Creating a new data product takes 3–6 months of specialist involvement. Each business question requires a dedicated data engineer, manual Snowflake provisioning, hand-written dbt models, and ad-hoc quality checks. There is no reusable infrastructure.
Disconnected systems. RTiS holds ontology definitions, Collibra holds governance metadata, Snowflake hosts the data — but no automated pipeline connects them. Metadata flows are manual, inconsistent, and frequently stale.
No shared semantic layer. KPI definitions diverge across teams. The same metric exists in 4+ variants. Global reporting requires manual Excel reconciliation between domain teams.
No data contracts. Upstream schema changes cascade unpredictably into downstream consumers. Breakages surface in production dashboards and board presentations. There is no machine-readable contract between producer and consumer.
Inconsistent quality assurance. Some data products have strict dbt tests; others have none. Users cannot distinguish certified data from unchecked data. This creates a false sense of quality that is more dangerous than having no quality gates at all.
No AI readiness. Zero data products have semantic definitions suitable for natural language queries. Cortex Analyst cannot be deployed. AI agents have no MCP tools to access governed data.
The immediate trigger is the Global Sites Network: 100+ operational tools that cannot communicate, producing siloed data with no shared semantics. The same structural problems exist across all Roche data domains. The platform is designed as a horizontal solution serving all domains, with Global Sites Network as the pilot.
4. Proposed Architecture (TO-BE)
Section titled “4. Proposed Architecture (TO-BE)”4.1 Solution Overview
Section titled “4.1 Solution Overview”RDT MODEL is a data infrastructure compiler: it takes a declarative model definition as input and produces a complete, deployable data product as output.
RTiS is the grammar.
roche-datais the compiler. Git is the object store. dbt is the runtime.
The compiler is implemented as a Rust CLI workspace containing 18 specialised binary modules plus one shared library. A single orchestration command — rdt-model-compile run --entity <name> — invokes all modules in dependency order across 6 sequential phases. Within each phase, modules that share no data dependency execute in parallel.
From one RTiS entity definition, the compiler produces:
| # | Artifact | Purpose |
|---|---|---|
| 1 | Bronze table + G1 DQ | Physical append-only landing, schema enforcement |
| 2 | Silver view + G2 DQ | Curated data, validity-checked against master data |
| 3 | Gold view + G3 DQ | Business-ready data, rule-checked and SLA-governed |
| 4 | Semantic view | AI-queryable metrics for Cortex Analyst |
| 5 | Data contract | Machine-readable schema + SLA + quality spec (datacontract.com 1.1.0) |
| 6 | OPA policies | Real-time DQ enforcement + access control (6 policy domains) |
| 7 | dbt tests | Batch DQ enforcement, aligned with OPA rules |
| 8 | Python + CLI SDK | Type-safe programmatic access for consumers |
| 9 | MCP tool | AI agent tool definition (Cortex Analyst, Claude) |
| 10 | OpenAPI spec | REST API contract for managed publication |
| 11 | Documentation | Offline docs for enterprise search (Sinequa) |
| 12 | Platform events | Solace creation/update events + ServiceNow change records |
| 13 | Audit trail | CSRD/GDPR/GxP-compliant creation audit |
All artifacts are committed to git, deployed through GitHub Actions CI/CD, and registered with GUPRI persistent identifiers. There is no partial product — every entity gets the full stack.
4.1.1 Evolutionary Architecture
Section titled “4.1.1 Evolutionary Architecture”The platform is designed to scale across three dimensions:
Multi-domain scaling. Global Sites Network is the pilot domain. The same CLI, templates, and pipeline serve every Roche data domain. Adding a new domain requires only RTiS model definitions and domain-specific business rules — no platform changes.
LLM enrichment. Claude on AWS Bedrock enriches metadata where RTiS coverage is insufficient: terminology mappings, synonym generation, field descriptions, and DQ rule suggestions. All LLM output is human-reviewable and subject to the four-eyes PR rule.
Self-service. Two Streamlit in Snowflake UI applications provide non-technical users with CRUD and ratification capabilities. The Starlight documentation site auto-generates from pipeline artifacts — it cannot drift from implementation.
Iterative quality. Data products ship structurally complete on day one. Quality rules improve iteratively: Bronze rules are mechanical (from schema), Silver rules add master data checks, Gold rules incorporate business logic from domain experts. The pipeline re-runs on model changes without re-architecture.
4.1.2 Alternatives Rejected
Section titled “4.1.2 Alternatives Rejected”| Alternative | Reason for rejection |
|---|---|
| Commercial data catalog (DataHub, Atlan, Unity Catalog) | These catalog what already exists. They do not generate the artifacts that make data AI-ready. RDT MODEL is upstream — it generates the metadata catalogs consume. Not mutually exclusive: Collibra remains as the governance layer. |
| Python CLI toolbox | Python CLIs carry virtualenv and dependency management debt into every CI pipeline. A Rust binary ships as a single file with no runtime dependencies, starts an order of magnitude faster in CI, and eliminates “it works on my machine” failures. |
| Central team builds all products | Preserves linear scalability: new business question = new ticket = 6–8 weeks. The platform shifts this to exponential: new question = one CLI command = one CI cycle. |
| Separate repos per domain | Artifact types are deeply coupled (contract YAML that Bronze depends on comes from the same model as Semantic YAML). Monorepo keeps versions consistent and makes template changes propagate to all domains simultaneously. |
4.2 Business Architecture
Section titled “4.2 Business Architecture”4.2.1 Data Product Lifecycle
Section titled “4.2.1 Data Product Lifecycle”A data product follows this lifecycle (see ADR 0007):
graph LR DEFINE["**DEFINE**<br/>RTiS + Collibra"] GENERATE["**GENERATE**<br/>CLI runs 18 mods"] VALIDATE["**VALIDATE**<br/>Schemas + DQ"] DEPLOY["**DEPLOY**<br/>CI/CD promote"] CERTIFY["**CERTIFY**<br/>G4 pass → stable"] REFINE["**REFINE**<br/>PR-based rule updates"]
DEFINE --> GENERATE --> VALIDATE --> DEPLOY --> CERTIFY CERTIFY --> REFINE REFINE --> DEFINE- Define — Data stewards define the entity in RTiS (schema, ontology, relationships) and configure governance in Collibra (ownership, SLA, classification).
- Generate —
rdt-model-compile runorchestrates all 18 modules to produce the complete artifact set. - Validate —
rdt-model-validatechecks all artifacts against JSON Schemas, syntax rules, and cross-references. - Deploy — GitHub Actions CI/CD promotes validated artifacts through dev → test → prod.
- Certify — G4 Consistency gate confirms data meets trend baselines and AI guardrails.
- Refine — Domain experts improve quality rules and business logic through normal PR workflow. Re-run generates updated artifacts.
4.2.2 Pipeline Phases
Section titled “4.2.2 Pipeline Phases”The 18 modules are organised into 6 sequential phases. Phases run in order; modules within a phase run in parallel when they share no data dependency.
graph LR subgraph Phase1["**Phase 1: INGEST**"] Pull Profile end subgraph Phase2["**Phase 2: ENRICH**"] Govern Infer end subgraph Phase3["**Phase 3: PREPARE**"] Compile Validate end subgraph Phase4["**Phase 4: DEPLOY**"] Store Policy Api Mcp Sdk Contract end subgraph Phase5["**Phase 5: REGISTER**"] Register Gupri Search end subgraph Phase6["**Phase 6: SUPPORT**"] Docs Cidb Event end
Phase1 --> Phase2 --> Phase3 --> Phase4 --> Phase5 --> Phase6| Phase | Name | Purpose | Modules | Parallelism |
|---|---|---|---|---|
| 1 | Ingest | Acquire source data from upstream systems | pull, profile | Full (no dependencies between modules) |
| 2 | Enrich | Add governance metadata and LLM intelligence | govern, infer | Full (no dependencies between modules) |
| 3 | Prepare | Compile artifacts and validate correctness | compile, validate | Sequential (validate depends on compile) |
| 4 | Deploy | Push artifacts to target systems | store, policy, api, mcp, sdk, contract | Full (6 modules in parallel) |
| 5 | Register | Announce to enterprise catalogs | register, gupri, search | Full (3 modules in parallel) |
| 6 | Support | Generate docs, compliance records, events | docs, cidb, event | Full (3 modules in parallel) |
4.2.3 Actor Roles
Section titled “4.2.3 Actor Roles”| Actor | Responsibility | Interaction |
|---|---|---|
| Data Engineer | Defines entity models in RTiS, authors rules.yaml and policies.yaml, runs the pipeline, reviews generated artifacts | CLI + Git + PR workflow |
| Data Steward | Maintains governance metadata in Collibra (ownership, SLA, classification, PII flags). Ratifies changes via Streamlit UI. | Collibra + Ratification UI |
| Platform Team | Maintains the CLI codebase, templates, and infrastructure. Resolves access tasks. Manages CI/CD workflows. | Rust development + GitHub Actions |
| Domain Expert | Refines Gold business rules and Semantic view definitions. Reviews LLM enrichment suggestions. | PR review + rules.yaml authoring |
| Consumer | Queries data products via SDK, API, Cortex Analyst, or direct Snowflake access. Relies on data contracts for stability guarantees. | SDK + API + SQL + natural language |
| AI Agent | Accesses Gold data products via MCP tools. Uses Semantic view definitions for natural language understanding. | MCP protocol + Cortex Analyst |
4.2.4 Business Process Flow
Section titled “4.2.4 Business Process Flow”sequenceDiagram participant DE as Data Engineer participant PL as Platform (automated) participant DS as Data Steward
DE->>PL: Define entity in RTiS DS->>PL: Configure governance in Collibra DE->>PL: Run pipeline (single cmd) PL->>PL: Pull model + governance PL->>PL: Enrich with LLM PL->>PL: Compile all artifacts PL->>PL: Validate schemas PL->>PL: Deploy to Snowflake PL->>PL: Register in catalogs PL->>PL: Publish events PL->>DE: Commit artifacts to git, open PR DE->>DE: Review PR (4-eyes rule) DE->>PL: Merge to main PL->>PL: CI/CD promotes: dev → test → prod PL->>DS: G4 certification DS->>DS: Ratify changes via Streamlit UI4.3 Data Architecture
Section titled “4.3 Data Architecture”4.3.1 Conceptual Data Model
Section titled “4.3.1 Conceptual Data Model”The platform operates on a three-stage conceptual model: source ontology → enriched entity model → generated artifacts.
graph TD RTiS["**RTiS Ontology**<br/>Schema, fields, types,<br/>relationships, terminologies"] Collibra["**Collibra Governance**<br/>Ownership, SLA, PII,<br/>classification"] LLM["**Claude on Bedrock**<br/>Term mapping, synonyms,<br/>descriptions"] Entity["**Entity Model (enriched)**<br/>model.json + governance.json<br/>+ suggestions.json"]
RTiS --> Entity Collibra --> Entity LLM --> Entity
Entity --> Snowflake["**Snowflake**<br/>Bronze DDL, Silver SQL,<br/>Gold SQL, Semantic"] Entity --> Quality["**Quality**<br/>OPA Rego, dbt tests,<br/>K8s deploy, Bundle cfg"] Entity --> Consumer["**Consumer Access**<br/>SDK, API spec,<br/>MCP tool, Docs"] Entity --> Platform["**Platform**<br/>Data contract, GUPRI URI,<br/>Solace event, Audit trail"]Key entities:
| Entity | Description | Cardinality |
|---|---|---|
| RTiS Entity | A logical data object (e.g., “waste-tracking”, “site-energy”). The unit of work. | 1 per data product |
| Field | A typed attribute within an entity. Carries RTiS metadata (type, terminology, synonyms). | N per entity |
| Rule Group | A collection of validation rules authored in rules.yaml. Compiled to both OPA and dbt. | 1 per entity |
| Policy Set | Access, workflow, API, audit, and deployment policies in policies.yaml. | 1 per entity |
| Governance Record | Collibra-sourced stewardship: owner, steward, SLA, classification, PII flags. | 1 per entity |
| Data Product | The complete output bundle: all Snowflake layers + all artifacts. | 1 per entity |
| GUPRI Record | Persistent identifier registration. Every artifact and entity gets a resolvable URI. | N per entity |
Relationships:
graph LR Entity["**RTiS Entity**"] Entity -->|1:N| Field Entity -->|1:1| RuleGroup["Rule Group"] Entity -->|1:1| PolicySet["Policy Set"] Entity -->|1:1| GovRecord["Governance Record<br/>(from Collibra)"] Entity -->|1:1| DataProduct["Data Product<br/>(generated)"] DataProduct -->|1:N| GUPRI["GUPRI Record<br/>(registered)"] DataProduct -->|1:1| Contract["Data Contract<br/>(generated)"] DataProduct -->|1:4| Layers["Snowflake Layers<br/>(Bronze/Silver/Gold/Semantic)"]4.3.2 Data Governance Architecture
Section titled “4.3.2 Data Governance Architecture”Data governance is architecturally embedded, not bolted on. Two systems enforce quality, and one system provides governance metadata.
Governance metadata source: Collibra
Collibra is the authoritative source for all governance metadata. Ownership, SLAs, data classification, and PII flags are not authored locally — they are pulled from Collibra where data stewards maintain them. The pipeline pulls governance at generation time and pushes lineage at deployment time.
| Metadata | Source | Used by |
|---|---|---|
| Data owner | Collibra | Data contract, documentation |
| Data steward | Collibra | Ratification UI, change notifications |
| SLA (availability, freshness) | Collibra | Data contract, G4 monitoring |
| Data classification | Collibra | OPA access policies, PII handling |
| PII flags | Collibra | Column-level masking policies |
| Terms of use | Collibra | Data contract, SDK documentation |
| Lineage | Generated → Collibra | Enterprise lineage graph |
Quality enforcement: four gates
Data passes through four mandatory quality gates before certification:
graph LR G1["**G1 COMPLETENESS**<br/>Bronze Layer<br/><br/>Schema match<br/>Type conformance<br/>NOT NULL enforced<br/><br/>_FAIL: file rejected_"] G2["**G2 VALIDITY**<br/>Silver View<br/><br/>MRHub identity<br/>Orphan detection<br/>Referential integrity<br/><br/>_FAIL: row excluded from Silver_"] G3["**G3 BUSINESS RULES**<br/>Gold View<br/><br/>Range checks<br/>Cross-field<br/>Freshness SLA<br/><br/>_FAIL: row excluded, steward alerted_"] G4["**G4 CONSISTENCY**<br/>Certified Product<br/><br/>Trend deviation<br/>AI guardrail<br/>30-day baseline<br/><br/>_FAIL: visible w/ Warning badge_"]
G1 --> G2 --> G3 --> G4Dual enforcement: OPA (real-time) + dbt (batch)
Rules are defined once in rules.yaml and compiled to two execution targets:
| Target | Engine | Context | Latency | Used by |
|---|---|---|---|---|
| OPA | Rego policies on Kubernetes | API boundary, form validation, real-time checks | Milliseconds | UI, API consumers, integration tests |
| dbt | SQL tests in Snowflake | Pipeline execution, batch validation, regression detection | Seconds–minutes | CI/CD, scheduled runs, monitoring |
Both targets are generated from the same rules.yaml source by rdt-model-policy. They must stay in sync — a rule that passes in OPA must also pass in dbt, and vice versa.
4.3.3 System Context
Section titled “4.3.3 System Context”The platform integrates with 21 external systems across source, target, and bidirectional roles.
graph TD Platform["**RDT MODEL Platform**<br/>18 CLI Modules + rdt-model-common<br/>+ 2 Streamlit UI apps"]
subgraph Sources["SOURCES"] RTiS["RTiS (ontology)"] Aurora["Aurora PostgreSQL"] MRHub["MRHub (master data)"] Bedrock["Claude/Bedrock (LLM)"] Vault["Vault (secrets)"] end
subgraph Bidirectional["BIDIRECTIONAL"] Collibra["Collibra (governance ↔ lineage)"] GUPRI["GUPRI (register ↔ resolve)"] Horizon["Snowflake Horizon (discovery ↔ governance)"] end
subgraph Targets["TARGETS"] Snowflake["Snowflake (DDL + views)"] K8s["CaaS/Kubernetes (OPA pods)"] Mulesoft["Mulesoft (API)"] Solace["Solace (events)"] Sinequa["Sinequa (search)"] ServiceNow["ServiceNow (CIDM)"] Artifactory["Artifactory (Docker images)"] DataMP["Data Marketplace"] MCPReg["MCP Registry"] end
subgraph PlatformSvc["PLATFORM SERVICES"] GHA["GitHub Actions (CI/CD)"] Ping["PingFederate/WAM (OAuth)"] Starlight["Starlight/Astro (docs)"] end
Sources --> Platform Platform <--> Bidirectional Platform --> Targets PlatformSvc -.-> PlatformSystem classification:
| Role | Systems | Data Flow |
|---|---|---|
| Source | RTiS, Aurora PostgreSQL, MRHub, Claude/Bedrock, Vault | System → Platform |
| Bidirectional | Collibra, GUPRI, Snowflake Horizon | System ↔ Platform |
| Target | Snowflake, CaaS/K8s, Mulesoft, Solace, Sinequa, ServiceNow, Artifactory, Data Marketplace, MCP Registry | Platform → System |
| Platform | GitHub Actions, PingFederate/WAM, Starlight/Astro | Infrastructure |
4.3.4 Interface Summary
Section titled “4.3.4 Interface Summary”Each row represents a data exchange between the RDT MODEL platform and an external system. Interface IDs are used for traceability to access tasks.
| IF-ID | Source | Target | Data Exchanged | Frequency | Protocol | Auth | Module | Access Task | Status |
|---|---|---|---|---|---|---|---|---|---|
| IF-01 | RTiS | Platform | Entity definitions, ontologies, terminologies, synonyms | On-demand (pipeline trigger) | REST / GraphQL | OAuth (PingFederate) | pull | A01 | Stub |
| IF-02 | Aurora PostgreSQL | Platform | Upstream table metadata (columns, types, constraints) | On-demand (profiling) | PostgreSQL wire protocol | Username/password (Vault) | profile | A18 | Stub |
| IF-03 | Snowflake WAM | Platform | OAuth access tokens for Snowflake operations | Per-session | REST OIDC (client_credentials) | OAuth (PingFederate) | All Snowflake ops | A19 | Live |
| IF-04 | Platform | Snowflake | Bronze DDL, dbt models (Silver/Gold/Semantic views) | On deployment | Snowflake REST API | OAuth (WAM token) | store | A05/A06 | Partial |
| IF-05 | Collibra | Platform | Governance metadata: ownership, SLA, classification, PII | On-demand (pipeline trigger) | REST API | OAuth | govern | A07 | Stub |
| IF-06 | Platform | Collibra | Lineage records after deployment | On deployment | REST API | OAuth | register | A08 | Stub |
| IF-07 | Claude (Bedrock) | Platform | LLM-generated term mappings, descriptions, DQ suggestions | On-demand (enrichment) | AWS Bedrock API | AWS IAM (SigV4) | infer | — | Planned |
| IF-08 | Platform | CaaS/K8s | OPA deployment manifests, Rego bundles, ConfigMaps | On deployment | Kubernetes API | Rancher token | policy | A13 | Active |
| IF-09 | Platform | Artifactory | OPA container images | On build | Docker Registry v2 | Token | policy (build) | — | Planned |
| IF-10 | Platform | Mulesoft | OpenAPI specifications for managed API publication | On deployment | Anypoint Platform API | OAuth | api | A09 | Stub |
| IF-11 | Platform | MCP Registry | MCP tool definitions for AI agent registration | On deployment | TBD | TBD | mcp | — | Planned |
| IF-12 | GUPRI | Platform | Persistent identifier resolution (existing URIs) | On-demand | REST API | OAuth (PingFederate) | gupri | A02 | Stub |
| IF-13 | Platform | GUPRI | Persistent identifier registration (new URIs) | On deployment | REST API | OAuth (PingFederate) | gupri | A02 | Stub |
| IF-14 | Platform | Snowflake Horizon | Cross-account discovery and governance metadata | On deployment | Snowflake API | OAuth (WAM token) | register | — | Stub |
| IF-15 | Platform | Data Marketplace | Data product catalog registration | On deployment | REST API | TBD | register | A15 | Stub |
| IF-16 | Platform | Sinequa | Offline documentation for enterprise search indexing | On deployment | TBD | TBD | search | A17 | Stub |
| IF-17 | MRHub | Platform | Master reference data for G2 validity lookups | On-demand | REST API | OAuth | policy | A03 | Not started |
| IF-18 | MRHub (Solace) | Platform | Change events for master data updates | Continuous | Solace event subscription | Token | policy | A04 | Not started |
| IF-19 | Platform | Solace | Data product creation/update lifecycle events | On deployment | Solace event publish | Token | event | A04 | Stub |
| IF-20 | Platform | ServiceNow | Change management records (CIDM) | On deployment | REST Table API | OAuth | cidb | A12 | Stub |
| IF-21 | Vault | Platform | Secrets (database credentials, API keys, tokens) | On CI/CD run | REST API (AppRole / OIDC) | AppRole / OIDC JWT | All (via CI) | A16 | Live |
Interface status legend:
| Status | Meaning |
|---|---|
| Live | Integration is operational with real credentials. |
| Partial | Authentication works; functional integration pending (e.g., Snowflake auth live, schema provisioning pending). |
| Active | Infrastructure access confirmed; integration under development. |
| Stub | Module implements the interface using StubClient with fixture data. Switching to live is a configuration change. |
| Planned | Module exists but integration work has not started. |
| Not started | Access task not yet filed or investigated. |
4.3.5 Data Migration
Section titled “4.3.5 Data Migration”This is a greenfield platform — there is no legacy system to migrate from. However, entity onboarding involves importing existing data structures:
Entity onboarding via rdt-model-profile
For entities that exist in upstream databases but lack RTiS representation, rdt-model-profile provides a discovery path:
graph LR DB["**Upstream DB**<br/>Aurora PG / Snowflake"] Profile["**rdt-model-profile**<br/>Discover tables<br/>Extract schema<br/>Suggest model"] RTiS["**RTiS**<br/>Register as<br/>new entity<br/>(manual step)"]
DB -->|profile| Profile -->|suggest| RTiSrdt-model-profileconnects to the upstream database (Aurora PostgreSQL or Snowflake).- Extracts table metadata: column names, types, constraints, sample data statistics.
- Produces a suggested entity model that a data engineer reviews and registers in RTiS.
- Once registered in RTiS, the standard pipeline takes over.
This is a discovery aid, not an automated migration. The data engineer makes all decisions about entity structure, naming, and classification. The profile output is a suggestion that accelerates the manual RTiS registration process.
Data backfill
Historical data from upstream systems is loaded into Bronze tables through the standard ingestion path. There is no special migration tool — Bronze tables are append-only, and historical data is simply the first batch of appended records. The Silver/Gold/Semantic views immediately operate over this data once loaded.
4.4 Logical Architecture
Section titled “4.4 Logical Architecture”The logical architecture is organised by pipeline phase. Each module follows the conventions in ADR 0008 (CLI module standards) and ADR 0009 (module I/O contracts). See also the Pipeline Overview for data flow and workspace isolation.
4.4.1 rdt-model-pull
Section titled “4.4.1 rdt-model-pull”Phase: 1 — Ingest
Purpose: Fetch an entity definition from RTiS and write a frozen JSON snapshot to the pipeline workspace. This is the entry point for every data product — all downstream modules consume the snapshot.
Parallelizable: Yes (parallel with profile within Phase 1)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | RTiS entity ID | CLI argument (--entity) | String |
| Output | Frozen entity snapshot | models/{entity}/model.json | JSON |
| Output | Module result envelope | stdout (when --json) | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| RTiS | RTisClient | OAuth (PingFederate) / Basic Auth | IF-01 | A01 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
tokio::runtime::Runtime(Pattern A) - Key crates:
reqwest(HTTP),async-trait,chrono,uuid(UUIDv7 run correlation) - Templates: None (data-only module)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
pull | Fetch entity from RTiS and write model.json |
diff | Show changes between local snapshot and RTiS (planned) |
list | List available entities in RTiS |
snapshot | Create versioned snapshot (planned) |
Current Status
Section titled “Current Status”Stub — Full command structure implemented. StubRTisClient returns fixture data from cli/common/src/clients/fixtures/rtis/. HttpRTisClient implemented with JSON-LD response mapping, pending A01 resolution for live RTiS access.
4.4.2 rdt-model-profile
Section titled “4.4.2 rdt-model-profile”Phase: 1 — Ingest (optional support module)
Purpose: Discover and profile existing database tables in upstream systems (Aurora PostgreSQL, Snowflake) to suggest entity models for RTiS registration. Used for onboarding entities that lack RTiS representation.
Parallelizable: Yes (parallel with pull within Phase 1)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Database connection + table identifier | CLI arguments (--database-type, --schema, --table) | String |
| Input | Sample row count | CLI argument (--sample-rows, default 100, max 10,000) | Integer |
| Output | Table structural metadata | {output_dir}/{db_type}/{schema}.{table}.profile.json | JSON |
| Output | Sample data | Embedded in profile JSON | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Aurora PostgreSQL | DatabaseProbe | Username/password (Vault) | IF-02 | A18 | Stub |
| Snowflake | DatabaseProbe | OAuth (WAM token) | IF-04 | A19 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
tokio::runtime::Runtime(Pattern A) - Key crates:
reqwest,async-trait,regex(identifier validation),flate2(optional gzip) - Templates: None (data-only module)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
profile | Profile a database table: extract schema, types, constraints, sample data |
Current Status
Section titled “Current Status”Stub — StubDatabaseProbe returns deterministic fixture data. SQL identifier validation implemented. Real Snowflake and Aurora PostgreSQL probes pending access tasks A18/A19.
4.4.3 rdt-model-govern
Section titled “4.4.3 rdt-model-govern”Phase: 2 — Enrich
Purpose: Pull governance metadata from Collibra for an entity — ownership, stewardship, data classification, SLAs, PII flags, and terms of use. This metadata feeds into the data contract, documentation, and access policies.
Parallelizable: Yes (parallel with infer within Phase 2)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity ID | CLI argument (--entity) | String |
| Input | model.json (from Phase 1) | models/{entity}/model.json | JSON |
| Output | Governance metadata | models/{entity}/governance.json | JSON |
| Output | Module result envelope | stdout (when --json) | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Collibra | CollibraClient | OAuth (client_id + client_secret + x-meta-bridge-key) | IF-05 | A07 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main]macro - Key crates:
tokio,tracing - Templates: None (data passthrough)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
pull | Fetch governance metadata from Collibra |
status | Show Collibra sync status (planned) |
Current Status
Section titled “Current Status”Stub — StubCollibraClient returns fixture CollibraMetadata. HttpCollibraClient implemented with pagination support. Blocked on A07 (Collibra access task).
4.4.4 rdt-model-infer
Section titled “4.4.4 rdt-model-infer”Phase: 2 — Enrich (optional)
Purpose: Enrich entity metadata using Claude on AWS Bedrock — generate term mappings, business-friendly synonyms, field descriptions, and DQ rule suggestions where RTiS coverage is insufficient. All suggestions are human-reviewable.
Parallelizable: Yes (parallel with govern within Phase 2)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity ID | CLI argument (--entity) | String |
| Input | model.json (from Phase 1) | models/{entity}/model.json | JSON |
| Input | Scope filter (optional) | CLI argument (--scope: terms, descriptions, rules) | String |
| Output | LLM enrichment suggestions | models/{entity}/suggestions.json | JSON |
| Output | Module result envelope | stdout (when --json) | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Claude (AWS Bedrock) | LlmClient | AWS IAM (SigV4) | IF-07 | — | Planned |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main]macro - Key crates:
tokio,serde,tracing - Templates: None (data-only module)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
suggest | Generate LLM suggestions with optional scope filter |
Current Status
Section titled “Current Status”Planned — Async skeleton implemented. StubLlmClient returns hardcoded suggestions. Real Bedrock integration not yet started. LLM provider confirmed as Claude on AWS Bedrock.
4.4.5 rdt-model-compile
Section titled “4.4.5 rdt-model-compile”Phase: 3 — Prepare Purpose: Pipeline orchestrator — invokes all downstream modules in dependency order, manages workspace lifecycle, aggregates results, and handles artifact promotion from workspace to repository paths. Parallelizable: No (sequential orchestrator; spawns parallel modules within phases)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity ID | CLI argument (--entity) | String |
| Input | model.json, governance.json, suggestions.json | From Phase 1–2 outputs | JSON |
| Input | rules.yaml, policies.yaml | models/{entity}/ | YAML |
| Output | Orchestration result | compile-result.json (workspace) | JSON |
| Output | All artifacts (20+) | compile/artifacts/ (workspace) | Mixed |
External System Integration
Section titled “External System Integration”None — the orchestrator delegates all external calls to downstream modules.
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: No — synchronous (Pattern B). Spawns child processes via
std::process::Command. - Key crates:
serde,serde_json,tracing - Templates: None (orchestrator has no template responsibility)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
run | Execute full pipeline (or single stage with --stage) |
status | Show pipeline status (planned) |
semantic | Generate Semantic YAML (delegated, planned) |
openapi | Generate OpenAPI spec (delegated, planned) |
mcp | Generate MCP tool definition (delegated, planned) |
rules | Compile rules.yaml to OPA Rego (delegated, planned) |
policies | Compile policies.yaml to OPA Rego (delegated, planned) |
k8s | Generate OPA K8s manifests (delegated, planned) |
Current Status
Section titled “Current Status”Planned — CLI structure defined with 8 subcommands. Orchestration logic not yet implemented. Will spawn modules as child processes with --json to capture result envelopes.
4.4.6 rdt-model-validate
Section titled “4.4.6 rdt-model-validate”Phase: 3 — Prepare
Purpose: Validate all generated artifacts against JSON Schemas, syntax rules, and cross-references. Acts as a quality gate — the pipeline does not proceed to Phase 4 (Deploy) unless validation passes.
Parallelizable: No (must run after compile completes)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | All artifacts from Phase 3 | compile/artifacts/ (workspace) | Mixed |
| Input | JSON Schemas | Embedded at compile time via include_str! | JSON Schema |
| Output | Validation report | stdout | Text / JSON (with --json) |
| Output | Exit code | 0 (pass) or non-zero (fail) | Process exit |
External System Integration
Section titled “External System Integration”None — pure local file validation.
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: No — synchronous (Pattern B). Pure file I/O.
- Key crates:
jsonschema(JSON Schema validation),serde_json,tracing - Templates: None (validation only)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
all | Validate all artifacts for an entity |
contract | Validate data contract YAML |
schema | Validate any JSON/YAML against a schema |
dbt | Validate dbt model files |
semantic | Validate semantic YAML |
rules | Validate rules against a rule group using OPA |
Current Status
Section titled “Current Status”Planned — CLI structure defined with 6 subcommands. Validation logic not yet implemented. Will use jsonschema crate with embedded schemas.
4.4.7 rdt-model-store
Section titled “4.4.7 rdt-model-store”Phase: 4 — Deploy
Purpose: Generate all Snowflake storage artifacts — Bronze DDL, dbt models for Bronze/Silver/Gold layers, and Semantic view definitions. Optionally deploys DDL to Snowflake.
Parallelizable: Yes (parallel with policy, api, mcp, sdk, contract within Phase 4)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity model | models/{entity}/model.json | JSON |
| Input | Governance metadata | models/{entity}/governance.json | JSON |
| Input | Rules definition | models/{entity}/rules.yaml | YAML |
| Output | Bronze DDL | snowflake/ddl/{entity}.sql | SQL |
| Output | dbt Bronze model + schema | dbt/models/bronze/{entity}.sql + .yml | SQL + YAML |
| Output | dbt Silver view + schema | dbt/models/silver/{entity}_silver.sql + .yml | SQL + YAML |
| Output | dbt Gold view + schema | dbt/models/gold/{entity}_gold.sql + .yml | SQL + YAML |
| Output | Semantic view | dbt/models/semantic/{entity}.semantic.yml | YAML |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Snowflake | SnowflakeClient | OAuth (WAM token) | IF-04 | A05/A06 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: No — synchronous (Pattern B). Pure template rendering.
- Key crates:
tera(templates),serde_yaml,chrono - Templates (8):
bronze_ddl.sql.tera,dbt_bronze.sql.tera,dbt_bronze.yml.tera,silver.sql.tera,silver.yml.tera,gold.sql.tera,gold.yml.tera,semantic.yml.tera
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
generate | Generate all Snowflake artifacts (optionally filter by --layer) |
apply | Deploy DDL to Snowflake (planned) |
Current Status
Section titled “Current Status”Stub — All 8 Tera templates present. Framework implemented with template rendering logic. StubSnowflakeClient for deployment. Execution pending full wiring.
4.4.8 rdt-model-policy
Section titled “4.4.8 rdt-model-policy”Phase: 4 — Deploy
Purpose: Compile rules.yaml and policies.yaml into OPA Rego policies, generate Kubernetes deployment manifests, and produce dbt test files. Handles both real-time (OPA) and batch (dbt) DQ enforcement from a single rule source.
Parallelizable: Yes (parallel within Phase 4)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Rules definition | models/{entity}/rules.yaml | YAML |
| Input | Policies definition | models/{entity}/policies.yaml | YAML |
| Input | Entity model | models/{entity}/model.json | JSON |
| Output | OPA Rego policies | k8s/{entity}/*.rego | Rego |
| Output | K8s Deployment | k8s/{entity}/opa-deployment.yaml | YAML |
| Output | K8s Service | k8s/{entity}/opa-service.yaml | YAML |
| Output | Bundle ConfigMap | k8s/{entity}/bundle-configmap.yaml | YAML |
| Output | Bundle refresh CronJob | k8s/{entity}/bundle-refresh-cronjob.yaml | YAML |
| Output | dbt test files | dbt/tests/{entity}/ | SQL |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| CaaS/Kubernetes | (pending) | Rancher token | IF-08 | A13 | Active |
| MRHub | (pending) | OAuth | IF-17 | A03 | Not started |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Mixed — sync for generation (Pattern B), async for deploy
- Key crates:
tera(templates),jsonschema(rule validation),serde_yaml,tokio,reqwest - Templates (7):
rego-policy.rego.tera,rego-validation.rego.tera,opa-deployment.yaml.tera,opa-service.yaml.tera,bundle-configmap.yaml.tera,bundle-refresh-cronjob.yaml.tera, plus dbt test template - Schemas (3):
rules.schema.json,policies.schema.json,validation-response.schema.json
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
generate | Generate Rego policies and K8s manifests (optionally filter by --domain) |
deploy | Deploy to CaaS cluster (planned) |
evaluate | Local OPA policy evaluation against input data |
dbt | Generate dbt test files (optionally filter by --gate) |
Current Status
Section titled “Current Status”Stub — All 7 templates and 3 schemas present. CaaS access confirmed (Rancher token active). Framework implemented; execution pending full wiring.
4.4.9 rdt-model-api
Section titled “4.4.9 rdt-model-api”Phase: 4 — Deploy Purpose: Generate OpenAPI 3.x specifications from entity models and publish to Mulesoft Anypoint Platform for managed API exposure. Parallelizable: Yes (parallel within Phase 4)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity model | models/{entity}/model.json | JSON |
| Input | Governance metadata | models/{entity}/governance.json | JSON |
| Output | OpenAPI spec | apis/{entity}/openapi.yaml | YAML (OpenAPI 3.x) |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Mulesoft | MulesoftClient | Anypoint OAuth | IF-10 | A09 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tera(templates),tokio,serde - Templates: Planned (OpenAPI YAML template)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
generate | Generate OpenAPI 3.x specification |
publish | Publish to Mulesoft Anypoint (planned) |
Current Status
Section titled “Current Status”Planned — Framework in place. No templates yet. StubMulesoftClient provides fixture responses. Blocked on A09 (Mulesoft access).
4.4.10 rdt-model-mcp
Section titled “4.4.10 rdt-model-mcp”Phase: 4 — Deploy Purpose: Generate MCP (Model Context Protocol) tool definitions that expose Gold data products to AI agents — Cortex Analyst, Claude, and other MCP-compatible agents. Parallelizable: Yes (parallel within Phase 4)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity model | models/{entity}/model.json | JSON |
| Input | Semantic view definition | dbt/models/semantic/{entity}.semantic.yml | YAML |
| Output | MCP tool definition | apis/{entity}/mcp_tool.json | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| MCP Registry | McpRegistryClient | TBD | IF-11 | — | Planned |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tera(templates),tokio,serde - Templates: Planned (MCP tool JSON template)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
generate | Generate MCP tool definition |
register | Register with MCP registry (hosting TBD) |
Current Status
Section titled “Current Status”Planned — Framework in place. MCP hosting model under investigation (Snowflake MCP not available on company account).
4.4.11 rdt-model-sdk
Section titled “4.4.11 rdt-model-sdk”Phase: 4 — Deploy Purpose: Generate type-safe SDK clients for programmatic data product access — Python package and cross-compiled Rust CLI targeting 5 platforms. Parallelizable: Yes (parallel within Phase 4)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity model | models/{entity}/model.json | JSON |
| Input | Data contract | models/{entity}/datacontract.yaml | YAML |
| Output | Python SDK | sdks/{entity}/python/ | Python package |
| Output | CLI SDK source | sdks/{entity}/cli/ | Rust source |
External System Integration
Section titled “External System Integration”None — pure code generation module.
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: No — synchronous (Pattern B). Pure code generation.
- Key crates:
tera(templates),serde_json,serde_yaml,chrono - Templates: Planned (Python + Rust CLI templates)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
python | Generate Python SDK package |
cli generate | Generate Rust CLI SDK source |
cli build | Cross-compile CLI binaries (optionally --target <triple>) |
Current Status
Section titled “Current Status”Planned — Framework implemented. No templates yet. Pure generation module with no external dependencies.
4.4.12 rdt-model-contract
Section titled “4.4.12 rdt-model-contract”Phase: 4 — Deploy Purpose: Generate a datacontract.com 1.1.0 YAML specification for the entity’s Gold and Semantic views. The contract defines schema, SLA, quality expectations, and ownership in a machine-readable format. Parallelizable: Yes (parallel within Phase 4)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity model | models/{entity}/model.json | JSON |
| Input | Governance metadata | models/{entity}/governance.json | JSON |
| Input | GUPRI record | models/{entity}/gupri.yaml | YAML |
| Output | Data contract | models/{entity}/datacontract.yaml | YAML (datacontract.com 1.1.0) |
External System Integration
Section titled “External System Integration”None — pure template rendering module.
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: No — synchronous (Pattern B). Pure template rendering.
- Key crates:
tera(templates),jsonschema(output validation),serde_yaml,chrono - Templates (1):
contract.yaml.tera - Schemas (1):
contract.schema.json(validates generated output)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
generate | Generate datacontract.yaml for an entity |
Current Status
Section titled “Current Status”Production-ready — Fully implemented with template rendering, schema validation, GUPRI integration, and dry-run support. Unit tests verify schema compliance.
4.4.13 rdt-model-register
Section titled “4.4.13 rdt-model-register”Phase: 5 — Register
Purpose: Register the deployed data product across enterprise discovery and governance systems — push lineage to Collibra, register in Snowflake Horizon, and catalog in Roche Data Marketplace.
Parallelizable: Yes (parallel with gupri and search within Phase 5)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | All Phase 4 deployment results | deploy/*-result.json | JSON |
| Input | Entity model + governance | models/{entity}/ | JSON + YAML |
| Output | Registration confirmations | register/register-result.json | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Collibra | CollibraClient | OAuth | IF-06 | A08 | Stub |
| Snowflake Horizon | HorizonClient | OAuth (WAM) | IF-14 | — | Stub |
| Data Marketplace | (TBD) | TBD | IF-15 | A15 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tokio,serde - Templates: None
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
collibra | Push lineage records to Collibra |
horizon | Register in Snowflake Horizon |
rdm | Register in Roche Data Marketplace |
all | Run all three registrations |
Current Status
Section titled “Current Status”Planned — Framework in place. All client traits defined with stubs. Blocked on A07/A08 (Collibra), A15 (Data Marketplace).
4.4.14 rdt-model-gupri
Section titled “4.4.14 rdt-model-gupri”Phase: 5 — Register Purpose: Register artifacts with GUPRI (Globally Unique Persistent Roche Identifier) to obtain resolvable URIs for every data product artifact. Parallelizable: Yes (parallel within Phase 5)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity ID + artifact type | CLI arguments | String |
| Output | GUPRI record | models/{entity}/gupri.yaml | YAML |
| Output | Module result envelope | stdout (when --json) | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| GUPRI | GupriClient | OAuth (PingFederate) | IF-12, IF-13 | A02 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tokio,reqwest,async-trait,jsonschema,serde_yaml - Schemas (1):
gupri.schema.json(validates GUPRI records) - Templates: None
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
register | Register artifact and obtain GUPRI URI (--artifact-type) |
resolve | Resolve an existing GUPRI URI to its record |
Current Status
Section titled “Current Status”Production-ready — Fully implemented with StubGupriClient, schema validation, YAML output, and dry-run support. Pending A02 for live GUPRI API integration.
4.4.15 rdt-model-search
Section titled “4.4.15 rdt-model-search”Phase: 5 — Register Purpose: Push offline documentation to Sinequa enterprise search engine for data product discovery across Roche. Parallelizable: Yes (parallel within Phase 5)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | All generated artifacts | Various paths | Mixed |
| Output | Search index push confirmation | register/search-result.json | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Sinequa | SinequaClient | TBD | IF-16 | A17 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tokio,serde - Templates: None
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
push | Push documentation to Sinequa search index |
Current Status
Section titled “Current Status”Planned — Framework in place. StubSinequaClient defined. Integration mechanism (API vs. file drop) TBD. Blocked on A17.
4.4.16 rdt-model-docs
Section titled “4.4.16 rdt-model-docs”Phase: 6 — Support
Purpose: Generate Starlight reference documentation from all pipeline artifacts — clap definitions, JSON Schemas, ADRs, contracts, and API specs. The docs site is a build artifact that cannot drift from implementation.
Parallelizable: Yes (parallel with cidb and event within Phase 6)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | All Phase 4 artifacts | Various | Mixed (SQL, YAML, JSON, Rego) |
| Input | ADR files | adr/ | Markdown |
| Input | CLI definitions | Embedded in binaries | Clap metadata |
| Output | Reference docs | docs/src/content/docs/reference/ | Markdown |
| Output | Architecture docs | docs/src/content/docs/architecture/ | Markdown |
| Output | Status docs | docs/src/content/docs/status/ | Markdown |
External System Integration
Section titled “External System Integration”None — pure generation from local files.
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: No — synchronous (Pattern B). Pure file reading and template rendering.
- Key crates:
serde_json,serde - Templates: Planned (Markdown templates for each doc type)
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
generate | Generate all Starlight reference documentation |
Current Status
Section titled “Current Status”Planned — Framework in place. Generation logic not yet implemented. Output directories (reference/, architecture/, status/) are exclusively owned by this module — no manual editing.
4.4.17 rdt-model-cidb
Section titled “4.4.17 rdt-model-cidb”Phase: 6 — Support Purpose: Create ServiceNow change management records (CIDM) for production deployments. Provides audit trail for change control compliance. Parallelizable: Yes (parallel within Phase 6)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Deployment results | Phase 4–5 result envelopes | JSON |
| Input | Entity metadata | models/{entity}/ | JSON + YAML |
| Output | Change request record | support/cidb-result.json | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| ServiceNow | ServicenowClient | OAuth | IF-20 | A12 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tokio,serde - Templates: None
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
register | Create ServiceNow change request for deployment |
Current Status
Section titled “Current Status”Planned — Framework in place. StubServicenowClient defined. Blocked on A12 (ServiceNow access).
4.4.18 rdt-model-event
Section titled “4.4.18 rdt-model-event”Phase: 6 — Support Purpose: Publish data product lifecycle events to the Solace enterprise event bus — notifying downstream systems and consumers of new, updated, or deprecated data products. Parallelizable: Yes (parallel within Phase 6)
Input/Output
Section titled “Input/Output”| Direction | Artifact | Path | Format |
|---|---|---|---|
| Input | Entity model | models/{entity}/model.json | JSON |
| Input | GUPRI record | models/{entity}/gupri.yaml | YAML |
| Output | Event publication confirmation | support/event-result.json | JSON |
External System Integration
Section titled “External System Integration”| System | Client Trait | Auth | Interface | Access Task | Status |
|---|---|---|---|---|---|
| Solace | SolaceClient | Token | IF-19 | A04 | Stub |
Technology
Section titled “Technology”- Language: Rust (edition 2021)
- Async: Yes —
#[tokio::main](Pattern A) - Key crates:
tokio,chrono,serde,serde_json,serde_yaml,jsonschema - Schemas (1):
event.schema.json(validates event payloads) - Event types:
Created,Updated,Verified,SupersededBy - Topic pattern:
rdt/data-product/{entity_id}/{event_type}
Subcommands
Section titled “Subcommands”| Command | Description |
|---|---|
publish | Publish lifecycle event (optionally --event-type, default: created) |
Current Status
Section titled “Current Status”Production-ready — Fully implemented with client certificate authentication (PEM), schema-validated event payloads, topic-based routing, and dry-run support. A04 resolved 2026-05-07. Additionally, all CLI modules now publish automatic execution events to Solace via rdt-model-common/events.rs.
4.4.19 rdt-model-common (Shared Library)
Section titled “4.4.19 rdt-model-common (Shared Library)”rdt-model-common is the [lib] member of the Cargo workspace. Every rdt-model-* binary depends on it. It provides no CLI interface — only shared types, traits, and utilities.
Architecture
Section titled “Architecture”rdt-model-common/├── src/│ ├── lib.rs ← module exports│ ├── cli.rs ← GlobalOpts: --target, --entity, --dry-run, --quiet, --json, --verbose│ ├── config.rs ← Config: roche-data.toml + env var overrides + environment resolution│ ├── paths.rs ← All output path functions (one per artifact type)│ ├── errors.rs ← CliError enum with exit code mapping│ ├── exit_codes.rs ← Standard exit codes per ADR 0008│ ├── fs.rs ← write_artifact() + write_json_artifact() → OutputAction│ ├── reporting.rs ← init_tracing(), ModuleResultBuilder, ModuleResult, OutputAction│ ├── run_id.rs ← UUIDv7 run correlation IDs│ ├── models/│ │ └── mod.rs ← Entity, GupriRecord, CollibraMetadata, RulesDefinition, etc.│ ├── clients/│ │ ├── mod.rs ← re-exports all client traits│ │ ├── rtis.rs ← RTisClient trait + StubRTisClient│ │ ├── collibra.rs ← CollibraClient trait + HttpCollibraClient + StubCollibraClient│ │ ├── gupri.rs ← GupriClient trait + StubGupriClient│ │ ├── snowflake.rs ← SnowflakeClient trait + StubSnowflakeClient│ │ ├── snowflake_auth.rs ← SnowflakeAuth (OAuth WAM token exchange)│ │ ├── postgres_auth.rs ← PostgresAuth (Vault credential retrieval)│ │ ├── horizon.rs ← HorizonClient trait + StubHorizonClient│ │ ├── solace.rs ← SolaceClient trait + StubSolaceClient│ │ ├── mulesoft.rs ← MulesoftClient trait + StubMulesoftClient│ │ ├── mcp_registry.rs ← McpRegistryClient trait + StubMcpRegistryClient│ │ ├── servicenow.rs ← ServicenowClient trait + StubServicenowClient│ │ ├── sinequa.rs ← SinequaClient trait + StubSinequaClient│ │ ├── rdm.rs ← RdmClient trait + StubRdmClient│ │ └── llm.rs ← LlmClient trait + StubLlmClient│ └── json/│ ├── mod.rs ← re-exports│ ├── handler.rs ← JsonHandler: simd-json parse, jsonschema validate, BufWriter write│ ├── schema_cache.rs ← Lazy OnceLock-based compiled schema cache│ └── errors.rs ← JsonValidationError enum└── schemas/ ├── manifest.json ← Shared base manifest schema └── result.json ← Shared result envelope schemaKey Subsystems
Section titled “Key Subsystems”Client Trait Pattern
Every external system has a Rust trait and two implementations: a real HTTP client and a stub. The binary layer receives &dyn SystemClient via dependency injection, enabling transparent switching between production and dry-run mode.
graph LR Binary["**Binary Module**<br/>e.g., rdt-model-pull"] Trait["**Trait (async)**<br/>e.g., RTisClient<br/>get_entity()<br/>list_entities()"] Http["**HttpRTisClient**<br/>(live HTTP)"] Stub["**StubRTisClient**<br/>(fixture data)"]
Binary --> Trait Http -.->|implements| Trait Stub -.->|implements| TraitClient selection logic: if --dry-run is set or credentials are missing, the module uses the stub client. Otherwise, it instantiates the real HTTP client. This is a configuration decision, not a code change.
JSON Handling (json/)
Centralised JSON processing with three guarantees:
- Parsing:
simd_jsonfor 2–4x faster throughput with SIMD acceleration - Validation: Three-layer approach —
jsonschemaat entry, serde types for structure,gardefor business logic - Writing: Direct-to-file via
BufWriter(no full string allocation)
The JsonHandler facade provides parse() for internal/trusted data and parse_validated() for external input that requires schema validation. A lazy OnceLock-based schema cache compiles schemas once and reuses them.
Path Management (paths.rs)
One function per artifact type. All output paths are constructed here — never inline in commands or generators. Adding a new artifact type means adding one function.
Reporting (reporting.rs)
Two-track system used by every binary:
| Track | Target | Purpose |
|---|---|---|
| Structured tracing | stderr | Human-readable progress (info!, debug!, warn!) |
| Result envelope | stdout | Machine-readable JSON for orchestrator integration |
init_reporting() is called first in every main(). Verbosity is controlled by --verbose / --quiet / RUST_LOG.
Filesystem Helpers (fs.rs)
write_artifact() and write_json_artifact() handle file output. They return OutputAction (Wrote, Skipped, Updated) which feeds into the result envelope. They respect --dry-run mode and use tracing for progress reporting.
4.4.20 Streamlit UI Applications
Section titled “4.4.20 Streamlit UI Applications”Two Streamlit in Snowflake applications provide consumer-facing interfaces outside the CLI pipeline. They are Python applications deployed directly into Snowflake’s Streamlit hosting environment.
rdt-ui-crud (Entity CRUD)
Section titled “rdt-ui-crud (Entity CRUD)”Purpose: Render a data entry form from a crud.json artifact, enabling create/read/update/delete operations against an entity’s Snowflake tables.
| Attribute | Value |
|---|---|
| Directory | ui/crud/ |
| Entry point | app.py |
| Input | models/{entity}/crud.json (generated by pipeline) |
| Dependencies | streamlit, snowflake-snowpark-python |
| Schema | schemas/crud.schema.json |
| Deployed to | Streamlit in Snowflake |
| Status | Scaffold (form rendering from spec; CRUD operations not yet wired) |
Data flow: The CLI pipeline generates crud.json → Streamlit app reads the spec → renders dynamic form → executes SQL against Snowflake session.
rdt-ui-ratification (Steward Ratification)
Section titled “rdt-ui-ratification (Steward Ratification)”Purpose: Enable data stewards to review and approve/change taxonomies, synonyms, definitions, and data tags (PII, classification, usage restrictions) for an entity.
| Attribute | Value |
|---|---|
| Directory | ui/ratification/ |
| Entry point | app.py |
| Input | models/{entity}/model.json + models/{entity}/governance.json |
| Dependencies | streamlit, snowflake-snowpark-python |
| Deployed to | Streamlit in Snowflake |
| Status | Scaffold (taxonomy display; approval workflow not yet wired) |
Data flow: Steward opens app → reviews entity metadata (taxonomies, definitions, governance) → approves or requests changes → changes feed back into the pipeline as governance updates.
4.4.21 Cross-Cutting Architectural Patterns
Section titled “4.4.21 Cross-Cutting Architectural Patterns”These patterns are enforced across all 18 binary modules.
Pattern 1: Generator Purity
Section titled “Pattern 1: Generator Purity”Generators are pure functions that take a resolved Model and return Result<String>. No filesystem access, no HTTP calls, no side effects. The command layer handles input loading and output writing. Generators live in the module that owns the artifact — never in rdt-model-compile.
pub fn generate_datacontract( model: &Model, gupri: &GupriRecord, governance: &CollibraMetadata,) -> Result<String> { let tmpl = include_str!("../templates/contract.yaml.tera"); let mut ctx = tera::Context::new(); ctx.insert("model", model); ctx.insert("gupri", gupri); ctx.insert("governance", governance); tera::Tera::one_off(tmpl, &ctx, false) .context("failed to render data contract template")}Pattern 2: Embedded Templates
Section titled “Pattern 2: Embedded Templates”All Tera templates are embedded at compile time using include_str!. The binary is fully self-contained — no runtime template file access. Template changes require recompilation, which triggers CI validation.
Pattern 3: Mandatory Reporting
Section titled “Pattern 3: Mandatory Reporting”Every binary calls cli.global.init_reporting() as the first action in main(). All progress uses tracing macros (never println!). When --json is passed, a machine-readable result envelope is emitted to stdout for orchestrator consumption.
Pattern 4: Environment Targeting
Section titled “Pattern 4: Environment Targeting”Every rdt-model-* command requires --target dev|test|prod. There is no default. The target drives:
- Snowflake schema prefix (
DEV_BRONZE,TEST_BRONZE,PROD_BRONZE) - Kubernetes namespace (
rdt-model-dev,rdt-model-test,rdt-model-prod) - Vault secret path (
secret/dev/ci,secret/test/ci,secret/prod/ci) - dbt target profile
Pattern 5: Error Handling
Section titled “Pattern 5: Error Handling”anyhow::Result throughout. Every ? has .context(...) for error chain clarity. No .unwrap() or .expect() outside #[cfg(test)]. Exit codes are standardised: 0 (success), 1 (runtime error), 2 (validation error), 3 (config error).
Pattern 6: Stub-First Development
Section titled “Pattern 6: Stub-First Development”Every external system integration follows the stub-first pattern. Modules implement the full interface using StubClient implementations that return fixture data. Switching to live is a configuration change (credentials present + --dry-run not set), not a code change. This keeps the full pipeline runnable without any credentials.
4.5 Physical Architecture
Section titled “4.5 Physical Architecture”4.5.1 Infrastructure Overview
Section titled “4.5.1 Infrastructure Overview”All environments share the same physical infrastructure. Separation is achieved through configuration — schema prefixes, namespaces, and Vault paths — not through separate systems. See ADR 0010 for full rationale.
graph TD subgraph Snowflake["**Snowflake (Cloud)**<br/>Account: roche-gsn | Database: RDT_MODEL"] subgraph SFDev["DEV"] DEV_B["DEV_BRONZE"] DEV_S["DEV_SILVER"] DEV_G["DEV_GOLD"] DEV_SE["DEV_SEMANTIC"] end subgraph SFTest["TEST"] TEST_B["TEST_BRONZE"] TEST_S["TEST_SILVER"] TEST_G["TEST_GOLD"] TEST_SE["TEST_SEMANTIC"] end subgraph SFProd["PROD"] PROD_B["PROD_BRONZE"] PROD_S["PROD_SILVER"] PROD_G["PROD_GOLD"] PROD_SE["PROD_SEMANTIC"] end SiS["Streamlit: rdt-ui-crud, rdt-ui-ratification"] Cortex["Cortex Analyst"] end
subgraph K8s["**CaaS / Kubernetes (Rancher)**<br/>Cluster: Cloud Prod eu-central-1 | Project: rdt_model"] K8sDev["ns: rdt-model-dev<br/>OPA pods + CronJobs"] K8sTest["ns: rdt-model-test<br/>OPA pods + CronJobs"] K8sProd["ns: rdt-model-prod<br/>OPA pods + CronJobs"] end
subgraph VaultSvc["**HashiCorp Vault**<br/>Auth: OIDC + AppRole | KV v2"] VDev["secret/dev/ci/"] VTest["secret/test/ci/"] VProd["secret/prod/ci/"] VCommon["secret/common/caas"] end
subgraph GHA["**GitHub Actions (CI/CD)**"] Workflows["validate.yml, deploy.yml, docs.yml"] Envs["Environments: dev, test, prod"] Runners["Runners: Roche VPN-connected"] end
subgraph Ping["**PingFederate / WAM (Identity)**"] OAuth["OAuth 2.0 client_credentials"] WAM["Snowflake WAM integration"] end4.5.2 Network Topology
Section titled “4.5.2 Network Topology”graph TD subgraph Internal["**ROCHE INTERNAL NETWORK**"] Dev["Developer Workstation<br/>rdt-model-* CLI"] GHA["GitHub Actions Runner<br/>rdt-model-* CI (VPN)"] subgraph VPN["**Roche VPN / Corporate Network**"] RTiS2["RTiS (AWS+VPN)"] GUPRI2["GUPRI (AWS+VPN)"] MRHub2["MRHub (AWS+VPN)"] Vault2["Vault (internal)"] CaaS2["CaaS/Rancher"] Ping2["PingFederate"] SN2["ServiceNow"] Art2["Artifactory"] end Dev --> VPN GHA --> VPN end
subgraph Cloud["**CLOUD / EXTERNAL**"] SF2["Snowflake (HTTPS)"] Bedrock2["AWS Bedrock (Claude)"] Mule2["Mulesoft Anypoint"] Sol2["Solace (Event Bus)"] end
VPN -->|HTTPS| CloudKey network constraints:
| System | Network Zone | Access Method |
|---|---|---|
| RTiS, GUPRI, MRHub | AWS behind Roche VPN | HTTPS from VPN-connected clients |
| Vault, CaaS, PingFederate | Roche internal | Direct internal HTTPS |
| Snowflake | Cloud (public endpoint) | HTTPS with OAuth (WAM/PingFederate) |
| AWS Bedrock | AWS Cloud | HTTPS with IAM SigV4 |
| Mulesoft, Solace | Cloud/Hybrid | HTTPS with OAuth |
| GitHub Actions | Cloud runners + VPN | Self-hosted runners on Roche VPN |
4.5.3 Environment Strategy
Section titled “4.5.3 Environment Strategy”All environments share identical infrastructure. Isolation is achieved through configuration at three levels:
| Layer | Dev | Test | Prod |
|---|---|---|---|
| Snowflake schemas | DEV_BRONZE, DEV_SILVER, DEV_GOLD, DEV_SEMANTIC | TEST_BRONZE, TEST_SILVER, TEST_GOLD, TEST_SEMANTIC | PROD_BRONZE, PROD_SILVER, PROD_GOLD, PROD_SEMANTIC |
| K8s namespace | rdt-model-dev | rdt-model-test | rdt-model-prod |
| Vault path | secret/dev/ci/ | secret/test/ci/ | secret/prod/ci/ |
| dbt target | dev | test | prod |
| GitHub Environment | dev (auto-deploy) | test (manual approval) | prod (reviewer approval) |
Config resolution:
base roche-data.toml → [environments.{target}] overrides → env var overridesCI/CD promotion flow:
graph LR Push["Push to main"] --> DEV["Deploy to DEV<br/>(auto)"] DEV --> TEST["Deploy to TEST<br/>(manual approval)"] TEST --> PROD["Deploy to PROD<br/>(reviewer approval)"]4.5.4 Container and Image Strategy
Section titled “4.5.4 Container and Image Strategy”OPA policy containers are built and deployed to CaaS Kubernetes:
| Component | Registry | Image | Deployment |
|---|---|---|---|
| OPA sidecar | Roche Artifactory | artifactory.roche.com/rdt-model/opa:{version} | K8s Deployment |
| Bundle refresh | Roche Artifactory | artifactory.roche.com/rdt-model/bundle-refresh:{version} | K8s CronJob |
Images are built in GitHub Actions, pushed to Artifactory, and deployed via generated Kubernetes manifests. Each entity gets its own OPA deployment with entity-specific Rego bundles.
4.5.5 Data Storage Architecture
Section titled “4.5.5 Data Storage Architecture”graph TD subgraph SF["**SNOWFLAKE — RDT_MODEL Database**"] subgraph Bronze["{ENV}_BRONZE (physical — append-only)"] BWT["waste_tracking"] BSE["site_energy"] BVQ["vendor_quality"] end subgraph Silver["{ENV}_SILVER (views — G2 validity)"] SWT["waste_tracking_silver"] SSE["site_energy_silver"] SVQ["vendor_quality_silver"] end subgraph Gold["{ENV}_GOLD (views — G3 business rules)"] GWT["waste_tracking_gold"] GSE["site_energy_gold"] GVQ["vendor_quality_gold"] end subgraph Semantic["{ENV}_SEMANTIC (views — Cortex Analyst)"] SMWT["waste_tracking_semantic"] SMSE["site_energy_semantic"] SMVQ["vendor_quality_semantic"] end Audit["AUDIT (cross-env, append-only)<br/>pipeline_audit_log"] end
Bronze --> Silver --> Gold --> SemanticKey design decisions (from ADR 0004):
- Bronze is the only physical write. Silver, Gold, and Semantic are views.
- Views eliminate schema migration at Silver/Gold/Semantic layers.
- DQ gates run at query time (view predicates), not at write time.
- Snowflake result cache and micro-partition pruning handle view performance.
4.6 Non-Functional Requirements
Section titled “4.6 Non-Functional Requirements”4.6.1 User Profiles
Section titled “4.6.1 User Profiles”| Profile | Description | Scale | Primary Interaction |
|---|---|---|---|
| Data Engineer | Roche domain data engineers who define entities, author rules, and run the pipeline. Power CLI users. | 5–15 across all domains (Phase 0–2), scaling to 50+ (Phase 5) | CLI + Git |
| Data Steward | Governance professionals maintaining metadata in Collibra. Non-technical, use Streamlit UI for ratification. | 3–10 per domain | Streamlit UI + Collibra |
| Platform Admin | Team maintaining the CLI codebase, templates, CI/CD, and infrastructure. | 2–5 | Rust development + GitHub |
| Domain Expert | Business analysts refining Gold rules and Semantic definitions via PR review. | 10–30 per domain | PR review + YAML authoring |
| Consumer (Human) | Analysts and scientists querying data products via SQL, SDK, or Cortex Analyst. | 100–1000+ per domain | SQL + SDK + NLQ |
| Consumer (AI Agent) | AI agents accessing data products via MCP tools. | Unbounded | MCP protocol |
| CI/CD Pipeline | GitHub Actions workflows running the pipeline on every merge. | Concurrent per entity × environment | CLI (--json mode) |
4.6.2 Performance Requirements
Section titled “4.6.2 Performance Requirements”CLI Execution
Section titled “CLI Execution”| Operation | Target | Constraint |
|---|---|---|
Full pipeline (compile run) | < 5 minutes per entity | Includes all 18 modules, stub mode |
| Single module (template rendering) | < 10 seconds | Pure Tera rendering, no network |
| Single module (API call + render) | < 30 seconds | Includes HTTP call + template rendering |
Artifact validation (validate all) | < 15 seconds per entity | Schema validation of 20+ artifacts |
| Profile discovery | < 60 seconds per table | Database metadata extraction |
Snowflake Query Performance
Section titled “Snowflake Query Performance”| Query Pattern | Target | Mechanism |
|---|---|---|
| Gold view — single entity KPI | < 5 seconds | Snowflake result cache + micro-partition pruning |
| Semantic view — Cortex Analyst query | < 10 seconds | NLQ → SQL → view chain |
| Silver view — full entity scan | < 30 seconds | Columnar scan, partition pruning on date |
| Bronze table — historical backfill query | < 60 seconds | Clustering key on reporting_date |
CI/CD Pipeline
Section titled “CI/CD Pipeline”| Stage | Target |
|---|---|
| PR validation (compile + validate) | < 3 minutes |
| Full deployment (dev) | < 10 minutes |
| Promotion (test → prod) | < 5 minutes (after approval) |
4.6.3 Capacity Requirements
Section titled “4.6.3 Capacity Requirements”Entity Scaling
Section titled “Entity Scaling”| Phase | Entity Count | Domains | Concurrent Pipelines |
|---|---|---|---|
| Phase 0–1 (current) | 3–5 entities | 1 (Global Sites Network) | 1 |
| Phase 2–3 | 20–50 entities | 2–3 domains | 5 |
| Phase 4–5 | 100–500 entities | 10+ domains | 20 |
Artifact Storage
Section titled “Artifact Storage”| Storage | Growth Model | Retention |
|---|---|---|
| Git repository (artifacts) | ~500 KB per entity (20+ files) | Indefinite (git history) |
| Snowflake Bronze tables | Append-only, entity-dependent | Time-travel + retention policy (TBD) |
| OPA bundles (K8s ConfigMaps) | ~10 KB per entity | Current version only |
| Docker images (Artifactory) | ~50 MB per OPA image version | Last 5 versions |
Snowflake Compute
Section titled “Snowflake Compute”| Environment | Warehouse Size | Auto-suspend | Usage Pattern |
|---|---|---|---|
| Dev | X-Small | 60 seconds | Interactive development |
| Test | Small | 120 seconds | CI/CD validation runs |
| Prod | Medium | 300 seconds | Scheduled pipeline + analyst queries |
4.6.4 Business Continuity
Section titled “4.6.4 Business Continuity”4.6.4.1 Availability
Section titled “4.6.4.1 Availability”| Component | Availability Target | Mechanism |
|---|---|---|
| Snowflake (query) | 99.9% (platform SLA) | Snowflake managed HA, multi-AZ |
| CaaS/K8s (OPA) | 99.5% | Replica count ≥ 2 for prod, health checks |
| GitHub Actions (CI/CD) | 99.9% (platform SLA) | GitHub managed |
| Vault (secrets) | 99.9% | Vault HA cluster (Roche managed) |
| Pipeline execution | Best-effort | Retry on transient failures; stub fallback |
4.6.4.2 Disaster Recovery
Section titled “4.6.4.2 Disaster Recovery”| Component | RPO | RTO | Strategy |
|---|---|---|---|
| Source code + artifacts | 0 (git) | Minutes | Git clone from GitHub (distributed) |
| Snowflake data | Per Snowflake Time Travel (up to 90 days) | Hours | Snowflake native DR (failover) |
| OPA policies | 0 (git) | Minutes | Re-deploy from git (K8s manifests) |
| Vault secrets | Per Vault snapshot schedule | Hours | Vault snapshot restore |
| Pipeline state | N/A (stateless) | Immediate | Re-run pipeline (idempotent) |
Git as artifact store provides inherent DR. All generated artifacts are committed to git. The repository is the source of truth. Any lost deployment can be recreated by re-running the pipeline against the committed model.
4.6.4.3 Security
Section titled “4.6.4.3 Security”| Concern | Control |
|---|---|
| Authentication | OAuth 2.0 via PingFederate (all systems). AWS IAM for Bedrock. |
| Authorization | Snowflake RBAC (role per environment). K8s RBAC (namespace scoped). Vault policies (path scoped). |
| Secrets management | HashiCorp Vault (OIDC + AppRole). No secrets in code or CI variables. |
| Data classification | Collibra-sourced PII flags → column-level masking in Snowflake. |
| Audit | Append-only audit table in Snowflake. Git history for all artifact changes. |
| Network | Roche VPN for internal systems. HTTPS for all external calls. No plain HTTP. |
| Supply chain | Cargo.lock pinned. GitHub Dependabot for CVE alerts. |
4.6.4.4 Maintainability
Section titled “4.6.4.4 Maintainability”| Aspect | Approach |
|---|---|
| Observability | Structured tracing (stderr) with level control (--verbose, --quiet, RUST_LOG). Machine-readable result envelopes (--json) for aggregation. |
| Debugging | --dry-run mode for safe testing. --verbose for full trace output. Workspace retention (--keep-workspace) for post-mortem inspection. |
| Code quality | Cargo clippy (deny warnings). cargo test --workspace in CI. Integration test feature flag (integration). |
| Documentation | Auto-generated from artifacts by rdt-model-docs. Cannot drift from implementation. ADRs for architectural decisions. |
| Dependency management | Cargo workspace with shared dependency versions. Dependabot alerts. Minimal external dependencies for pure-rendering modules. |
| Template evolution | Template changes propagate to all entities on next pipeline run. No per-entity customisation — consistency enforced by design. |
Appendix A: Technology Stack
Section titled “Appendix A: Technology Stack”| Category | Technology | Version | Purpose |
|---|---|---|---|
| Language | Rust | Edition 2021 | CLI implementation (18 binaries + 1 library) |
| Build | Cargo | Workspace | Multi-crate build, dependency management |
| CLI framework | clap | 4.x | Command-line argument parsing, subcommands |
| Template engine | Tera | 1.x | Embedded template rendering (SQL, YAML, Rego, K8s manifests) |
| JSON parsing | simd-json | 0.14 | SIMD-accelerated JSON parsing (via JsonHandler) |
| JSON Schema | jsonschema | 0.18 | Artifact validation (Draft 2020-12) |
| Serialization | serde + serde_json + serde_yaml | 1.x | JSON/YAML serialization/deserialization |
| HTTP client | reqwest | 0.12 | External system API calls |
| Async runtime | tokio | 1.x | Async I/O for network-bound modules |
| Async traits | async-trait | 0.1 | Async trait definitions for client traits |
| Tracing | tracing + tracing-subscriber | 0.1 | Structured logging and diagnostics |
| Date/time | chrono | 0.4 | Timestamps, date handling |
| UUID | uuid | 1.x | UUIDv7 run correlation IDs |
| Compression | flate2 | 1.x | Optional gzip for profile output |
| Regex | regex | 1.x | SQL identifier validation, pattern matching |
| Env files | dotenvy | 0.15 | .env file loading |
| Data platform | Snowflake | — | Medallion architecture (Bronze/Silver/Gold/Semantic) |
| Transformation | dbt | Core | View generation, batch DQ tests |
| Policy engine | Open Policy Agent | 0.x | Real-time DQ enforcement, access control |
| Policy language | Rego | — | Policy definitions compiled from YAML DSL |
| Container platform | Kubernetes (Rancher/CaaS) | 1.x | OPA deployment, bundle refresh jobs |
| Container registry | Artifactory | — | Docker image storage for OPA containers |
| Secret management | HashiCorp Vault | — | OIDC + AppRole auth, KV v2 secrets |
| Identity provider | PingFederate | — | OAuth 2.0 (client_credentials) for all systems |
| CI/CD | GitHub Actions | — | Validate, deploy, docs workflows |
| Documentation | Starlight (Astro) | — | Generated reference documentation site |
| LLM | Claude on AWS Bedrock | — | Metadata enrichment (term mapping, descriptions) |
| UI framework | Streamlit in Snowflake | — | CRUD and ratification web applications |
| Data contract | datacontract.com | 1.1.0 | Machine-readable schema + SLA + quality spec |
| Event bus | Solace | — | Enterprise event publishing |
| Search | Sinequa | — | Enterprise search indexing |
| API gateway | Mulesoft (Anypoint) | — | Managed API publication |
| AI query | Snowflake Cortex Analyst | — | Natural language query over Semantic views |
Appendix B: Vault Path Mapping
Section titled “Appendix B: Vault Path Mapping”Common Paths (shared across environments)
Section titled “Common Paths (shared across environments)”| Path | Contents | Used by |
|---|---|---|
secret/common/caas | Rancher token, cluster URL | rdt-model-policy (K8s deployment) |
secret/common/artifactory | Docker registry credentials | CI/CD (image push) |
secret/common/github | GitHub App credentials | CI/CD workflows |
Per-Environment Paths
Section titled “Per-Environment Paths”| Path Pattern | Contents | Used by |
|---|---|---|
secret/{env}/ci/snowflake | Snowflake OAuth client_id/secret, account, warehouse, role | rdt-model-store, all Snowflake ops |
secret/{env}/ci/collibra | Collibra API client_id/secret, bridge key | rdt-model-govern, rdt-model-register |
secret/{env}/ci/rtis | RTiS API credentials (Basic Auth or OAuth) | rdt-model-pull |
secret/{env}/ci/gupri | GUPRI API credentials | rdt-model-gupri |
secret/{env}/ci/mulesoft | Anypoint Platform credentials | rdt-model-api |
secret/{env}/ci/solace | Solace connection credentials | rdt-model-event |
secret/{env}/ci/servicenow | ServiceNow API credentials | rdt-model-cidb |
secret/{env}/ci/sinequa | Sinequa API credentials | rdt-model-search |
secret/{env}/ci/bedrock | AWS IAM credentials for Bedrock | rdt-model-infer |
secret/{env}/ci/postgres | Aurora PostgreSQL credentials | rdt-model-profile |
secret/{env}/ci/mrhub | MRHub API credentials | rdt-model-policy |
Where {env} is one of dev, test, prod.
Appendix C: Access Task Status Matrix
Section titled “Appendix C: Access Task Status Matrix”Access tasks track the provisioning of credentials and network paths to external systems. Each task is a GitHub Issue.
| ID | System | Description | Issue | Status |
|---|---|---|---|---|
| A01 | RTiS | REST API credentials + network path | #15 | Pending |
| A02 | GUPRI | REST API credentials + network path | #16 | Pending |
| A03 | MRHub | REST API credentials for G2 lookups | #24 | Not started |
| A04 | MRHub / Solace | Solace event subscription + publish credentials | #24 | Not started |
| A05 | Snowflake | Service account, database, schema provisioning | #23 | Partial (auth live) |
| A06 | Snowflake | Cortex Analyst feature enablement | #23 | Pending |
| A07 | Collibra | REST API credentials for governance metadata pull | #25 | Pending |
| A08 | Collibra | REST API credentials for lineage push | #25 | Pending |
| A09 | Mulesoft | Anypoint Platform API credentials | #26 | Pending |
| A10 | GitHub Actions | Workflow configuration + runner access | — | Done |
| A11 | GitHub Actions | Runner VPN access for internal systems | — | Done |
| A12 | ServiceNow | Table API credentials for CIDM | #27 | Pending |
| A13 | CaaS/K8s | Rancher access + namespace provisioning | #28 | Active |
| A14 | LeanIX | EA catalog API credentials (stretch) | #29 | Not started |
| A15 | Data Marketplace | Registry API credentials (stretch) | #30 | Not started |
| A16 | Vault | OIDC + AppRole configuration for CI | #70 | Done |
| A17 | Sinequa | Search API credentials + push mechanism | #80 | Pending |
| A18 | Aurora PostgreSQL | Database connection credentials for profiling | TBD | Not started |
| A19 | Snowflake WAM | OAuth token exchange configuration | TBD | Done |
Appendix D: ADR Cross-Reference
Section titled “Appendix D: ADR Cross-Reference”| ADR | Title | Status | Sections Referenced |
|---|---|---|---|
| 0001 | Project Vision | Accepted | §1, §3, §4.1, §4.2 |
| 0002 | Rust as CLI Implementation Language | Accepted | §4.1.2, Appendix A |
| 0003 | Monorepo Structure | Accepted | §4.1.2, §4.4 |
| 0004 | Virtual Medallion Architecture | Accepted | §4.3.2, §4.5.5 |
| 0005 | Rule Engine — MODEL DSL to OPA on K8s | Accepted | §4.3.2, §4.4.8 |
| 0005b | OPA as MODEL Unified Policy Engine | Accepted | §4.3.2, §4.4.8 |
| 0006 | Multi-Binary Cargo Workspace | Superseded by 0011 | — |
| 0007 | Data Product Lifecycle | Proposed | §4.2.1, §4.2.2 |
| 0008 | CLI Module Development Standards | Proposed | §4.4, §4.4.21 |
| 0009 | Module I/O Contracts | Accepted | §4.4, Pipeline Overview |
| 0010 | Environment Strategy | Proposed | §4.5.3 |
| 0011 | Pipeline Restructure (19-module / 6-phase) | Accepted | §4.2.2, §4.4 |
Appendix E: Module Implementation Status
Section titled “Appendix E: Module Implementation Status”| Module | Phase | Async | Templates | Schemas | Implementation | Client Trait |
|---|---|---|---|---|---|---|
rdt-model-pull | 1 | Yes | 0 | 1 | Stub (HTTP client ready) | RTisClient |
rdt-model-profile | 1 | Yes | 0 | 2 | Stub | DatabaseProbe |
rdt-model-govern | 2 | Yes | 0 | 0 | Stub (HTTP client ready) | CollibraClient |
rdt-model-infer | 2 | Yes | 0 | 1 | Planned | LlmClient |
rdt-model-compile | 3 | No | 0 | 0 | Planned (orchestrator) | None |
rdt-model-validate | 3 | No | 0 | 0 | Planned | None |
rdt-model-store | 4 | No | 8 | 0 | Stub (templates ready) | SnowflakeClient |
rdt-model-policy | 4 | Mixed | 7 | 3 | Stub (templates ready) | (pending) |
rdt-model-api | 4 | Yes | 0 | 0 | Planned | MulesoftClient |
rdt-model-mcp | 4 | Yes | 0 | 0 | Planned | McpRegistryClient |
rdt-model-sdk | 4 | No | 0 | 0 | Planned | None |
rdt-model-contract | 4 | No | 1 | 1 | Production | None |
rdt-model-register | 5 | Yes | 0 | 0 | Planned | CollibraClient, HorizonClient |
rdt-model-gupri | 5 | Yes | 0 | 1 | Production | GupriClient |
rdt-model-search | 5 | Yes | 0 | 0 | Planned | SinequaClient |
rdt-model-docs | 6 | No | 0 | 0 | Planned | None |
rdt-model-cidb | 6 | Yes | 0 | 0 | Planned | ServicenowClient |
rdt-model-event | 6 | Yes | 0 | 1 | Production | SolaceClient |
Production = fully executable with fixtures (not stubbed logic). Stub = framework with templates/schemas present; execution delegates to stub clients. Planned = CLI skeleton defined; execution not yet implemented.
Appendix F: Diagram Index
Section titled “Appendix F: Diagram Index”| Diagram | Location | Used In |
|---|---|---|
| Platform Flow (6 phases, ASCII) | Inline in §4.2.2 | §4.2, §4.4 |
| CLI Module Architecture (SVG) | docs/src/assets/diagrams/model-cli.svg | §4.4 |
| Medallion Architecture (SVG) | docs/src/assets/diagrams/model-medallion.svg | §4.3, §4.5 |
| System Context (ASCII) | Inline in §4.3.3 | §4.3 |
| Physical Infrastructure (ASCII) | Inline in §4.5.1 | §4.5 |
| Network Topology (ASCII) | Inline in §4.5.2 | §4.5 |
| Data Storage Layout (ASCII) | Inline in §4.5.5 | §4.5 |
| DQ Gate Flow (ASCII) | Inline in §4.3.2 | §4.3 |
| Data Product Lifecycle (ASCII) | Inline in §4.2.1 | §4.2 |
| Business Process Flow (ASCII) | Inline in §4.2.4 | §4.2 |
| Conceptual Data Model (ASCII) | Inline in §4.3.1 | §4.3 |