Semantic Layer on Iceberg
Philosophy
Section titled “Philosophy”Traditional data platforms bundle three concerns into one system: storage, schema enforcement, and relationship metadata. Apache Iceberg decouples the first — storage becomes an open table format on object storage, readable by any engine. But Iceberg deliberately has no concept of relationships. No primary keys, no foreign keys, no constraints.
This creates a question: where do relationships live?
The roche-data answer: a semantic graph maintained in RTiS, compiled into every consumption layer by the pipeline. Iceberg stores the data. The semantic graph defines how entities connect. The pipeline combines both into consumable views, AI definitions, data contracts, and enterprise catalog entries.
The principle: flat storage, rich meaning. Every Iceberg table is a simple, engine-agnostic append log. All intelligence — types, labels, relationships, quality rules — lives above it as compiled metadata.
Why separate storage from meaning?
Section titled “Why separate storage from meaning?”| Concern | Traditional approach | Semantic graph approach |
|---|---|---|
| Data quality | Reject bad rows at write time | Filter invalid rows at read time — nothing is lost, issues are visible |
| Relationship discovery | Locked inside one database engine | Available across all systems — Snowflake, Cortex, Collibra, Horizon |
| Schema changes | Risky migrations with downtime | Backwards-compatible evolution — old and new readers coexist |
| Multi-engine access | Vendor lock-in | Any engine reads the same Iceberg data |
| AI readiness | Requires manual configuration per tool | Semantic view provides relationships to Cortex Analyst automatically |
Quality enforcement at read time means Bronze never rejects data — it lands everything. Silver and Gold views apply the relationship graph to filter, enrich, and validate. This makes the system resilient: raw data is always preserved, and quality rules can be refined without re-ingesting.
Metadata producers and consumers
Section titled “Metadata producers and consumers”The semantic graph flows through the pipeline from source systems to consumption targets. Each step either produces or consumes relationship metadata.
Producers — where semantic metadata originates
Section titled “Producers — where semantic metadata originates”| Source | What it provides | Pipeline step |
|---|---|---|
| RTiS GraphDB | Entity structure, field definitions, terminology bindings, inter-entity relationships | Pull |
| Upstream databases | Physical relationship constraints, column metadata, table statistics | Profile |
| Collibra | Stewardship rules, ownership, SLA, classification, sensitivity flags | Govern |
| MRHub | Master data for validity checks — confirms that referenced entities exist | Policy |
| Developers | Relationship definitions authored in code, validated by CI | Git |
Consumers — where semantic metadata is delivered
Section titled “Consumers — where semantic metadata is delivered”| Consumer | What it receives | Purpose |
|---|---|---|
| Snowflake Bronze (Iceberg) | Flat tables with catalog-level relationship metadata | Storage and catalog visibility |
| Silver Views | Relationship-driven enrichment — IDs resolved to human-readable labels | Curated, analyst-ready data |
| Gold Views | Business metrics enriched with linked entity attributes | KPI calculation and reporting |
| Semantic View | Explicit relationship declarations for Cortex Analyst | AI-driven natural language queries |
| Data Contract | Machine-readable relationship inventory | Consumer discovery and integration |
| Snowflake Horizon | Cross-account lineage and governance metadata | Enterprise-wide data discovery |
| Collibra | Entity-to-entity lineage edges | Governance graph and impact analysis |
How meaning builds through the medallion layers
Section titled “How meaning builds through the medallion layers”Each layer adds semantic richness on top of the flat Iceberg storage.
| Layer | What it contains | Role of relationships |
|---|---|---|
| Bronze | Raw identifiers and codes exactly as received from source | Relationships stored as catalog metadata — not applied to data |
| Silver | IDs plus human-readable labels from linked entities | Relationships drive label resolution — “SITE_ID” becomes “Basel Packaging Plant” |
| Gold | Business metrics with full context from related entities | Relationships determine which linked attributes are surfaced for KPIs |
| Semantic | AI-queryable model with declared cross-entity connections | Relationships enable Cortex Analyst to answer multi-entity questions without manual configuration |
Example: A product quality record in Bronze contains only SITE_ID = "CH-004". The Silver view uses the relationship graph to join to the Sites entity and add SITE_NAME = "Basel Packaging Plant". The Gold view calculates quality-per-site metrics using this enrichment. The Semantic view tells Cortex Analyst that quality records connect to sites via SITE_ID, enabling natural language questions like “What is the defect rate at Basel this quarter?”
Relationship model
Section titled “Relationship model”Each relationship connects a property in one entity to a property in another. The definition includes cardinality (how many records connect) and a semantic type (what the connection means).
Semantic types
Section titled “Semantic types”| Type | Meaning | Effect on data layers |
|---|---|---|
references | This property points to a record in another entity | Silver/Gold views resolve the linked record’s attributes |
classifies | This property maps to a controlled vocabulary or taxonomy | Views resolve the code to its display label |
contains | One entity owns instances of another (parent-child) | Gold views can nest or aggregate child records |
derives-from | This entity was produced from another entity’s data | Lineage tracking only — no data enrichment |
is-a | This entity is a specialisation of another (ontology hierarchy) | Inheritance in the semantic model |
The semantic type drives how the pipeline compiles the relationship. A references relationship generates an enrichment join. A derives-from relationship generates a lineage record in Collibra but does not affect the data views.
Authoring workflow
Section titled “Authoring workflow”Relationships are authored in Git and pushed to RTiS via CI. This gives teams a familiar review process while building toward RTiS as the long-term source of truth.
Phase 1 — Developer authoring (now)
Section titled “Phase 1 — Developer authoring (now)”Teams define relationships in a manifest file committed to Git. A pull request triggers validation (do the referenced entities exist? are the types valid?). On merge, CI pushes the definitions to RTiS GraphDB. The next pipeline run pulls the updated graph.
Phase 2 — Automated discovery (mid-term)
Section titled “Phase 2 — Automated discovery (mid-term)”The profile module connects to upstream databases, extracts existing relationship constraints, and proposes new relationships via pull request. Developers review, adjust, and merge — the same workflow as Phase 1, but seeded automatically rather than manually.
Phase 3 — Steward UI (long-term)
Section titled “Phase 3 — Steward UI (long-term)”RTiS provides a visual interface for data stewards to define and manage relationships directly. The pipeline pulls from RTiS. Git becomes a frozen snapshot for reproducible builds, not the primary authoring surface.
Design decisions
Section titled “Design decisions”| Decision | Rationale |
|---|---|
| Iceberg for Bronze | Open format readable by any engine. Schema evolution without downtime. No vendor lock-in. |
| RTiS as relationship source of truth | GraphDB handles millions of entity connections. One query returns the entire graph. |
| Git as initial authoring surface | Familiar workflow for teams. Auditable history. CI-validated before any change reaches production. |
| Snowflake catalog metadata | Horizon and Cortex Analyst read relationship metadata natively. No additional tooling needed for discovery. |
| Quality enforcement at read time | Bronze never loses data. Quality rules can evolve without re-ingestion. Issues are visible, not hidden by rejection. |
| Semantic types on relationships | Different relationship types produce different outputs. Distinguishing “references” from “derives-from” prevents unnecessary joins and keeps lineage clean. |
| Domain-level relationship manifest | A single view of all entity connections avoids fragmented ownership and makes the graph navigable. |