Semantic Layer on Iceberg

Philosophy

Traditional data platforms bundle three concerns into one system: storage, schema enforcement, and relationship metadata. Apache Iceberg decouples the first — storage becomes an open table format on object storage, readable by any engine. But Iceberg deliberately has no concept of relationships. No primary keys, no foreign keys, no constraints.

This creates a question: where do relationships live?

The roche-data answer: a semantic graph maintained in RTiS, compiled into every consumption layer by the pipeline. Iceberg stores the data. The semantic graph defines how entities connect. The pipeline combines both into consumable views, AI definitions, data contracts, and enterprise catalog entries.

Semantic layer philosophy — RTiS and Iceberg converge through the pipeline into Silver/Gold views, Semantic Views, and Data Contracts

The principle: flat storage, rich meaning. Every Iceberg table is a simple, engine-agnostic append log. All intelligence — types, labels, relationships, quality rules — lives above it as compiled metadata.

Why separate storage from meaning?

Concern	Traditional approach	Semantic graph approach
Data quality	Reject bad rows at write time	Filter invalid rows at read time — nothing is lost, issues are visible
Relationship discovery	Locked inside one database engine	Available across all systems — Snowflake, Cortex, Collibra, Horizon
Schema changes	Risky migrations with downtime	Backwards-compatible evolution — old and new readers coexist
Multi-engine access	Vendor lock-in	Any engine reads the same Iceberg data
AI readiness	Requires manual configuration per tool	Semantic view provides relationships to Cortex Analyst automatically

Quality enforcement at read time means Bronze never rejects data — it lands everything. Silver and Gold views apply the relationship graph to filter, enrich, and validate. This makes the system resilient: raw data is always preserved, and quality rules can be refined without re-ingesting.

Metadata producers and consumers

The semantic graph flows through the pipeline from source systems to consumption targets. Each step either produces or consumes relationship metadata.

Metadata flows from producers (RTiS, databases, Collibra, developers) through pipeline artifacts to consumers (Snowflake layers, Cortex, Horizon, Data Contract)

Producers — where semantic metadata originates

Source	What it provides	Pipeline step
RTiS GraphDB	Entity structure, field definitions, terminology bindings, inter-entity relationships	Pull
Upstream databases	Physical relationship constraints, column metadata, table statistics	Profile
Collibra	Stewardship rules, ownership, SLA, classification, sensitivity flags	Govern
MRHub	Master data for validity checks — confirms that referenced entities exist	Policy
Developers	Relationship definitions authored in code, validated by CI	Git

Consumers — where semantic metadata is delivered

Consumer	What it receives	Purpose
Snowflake Bronze (Iceberg)	Flat tables with catalog-level relationship metadata	Storage and catalog visibility
Silver Views	Relationship-driven enrichment — IDs resolved to human-readable labels	Curated, analyst-ready data
Gold Views	Business metrics enriched with linked entity attributes	KPI calculation and reporting
Semantic View	Explicit relationship declarations for Cortex Analyst	AI-driven natural language queries
Data Contract	Machine-readable relationship inventory	Consumer discovery and integration
Snowflake Horizon	Cross-account lineage and governance metadata	Enterprise-wide data discovery
Collibra	Entity-to-entity lineage edges	Governance graph and impact analysis

How meaning builds through the medallion layers

Each layer adds semantic richness on top of the flat Iceberg storage.

Semantic enrichment through medallion layers — Bronze stores raw IDs, Silver resolves labels, Gold calculates metrics, Semantic enables AI queries

Layer	What it contains	Role of relationships
Bronze	Raw identifiers and codes exactly as received from source	Relationships stored as catalog metadata — not applied to data
Silver	IDs plus human-readable labels from linked entities	Relationships drive label resolution — “SITE_ID” becomes “Basel Packaging Plant”
Gold	Business metrics with full context from related entities	Relationships determine which linked attributes are surfaced for KPIs
Semantic	AI-queryable model with declared cross-entity connections	Relationships enable Cortex Analyst to answer multi-entity questions without manual configuration

Example: A product quality record in Bronze contains only SITE_ID = "CH-004". The Silver view uses the relationship graph to join to the Sites entity and add SITE_NAME = "Basel Packaging Plant". The Gold view calculates quality-per-site metrics using this enrichment. The Semantic view tells Cortex Analyst that quality records connect to sites via SITE_ID, enabling natural language questions like “What is the defect rate at Basel this quarter?”

Relationship model

Each relationship connects a property in one entity to a property in another. The definition includes cardinality (how many records connect) and a semantic type (what the connection means).

Semantic types

Type	Meaning	Effect on data layers
`references`	This property points to a record in another entity	Silver/Gold views resolve the linked record’s attributes
`classifies`	This property maps to a controlled vocabulary or taxonomy	Views resolve the code to its display label
`contains`	One entity owns instances of another (parent-child)	Gold views can nest or aggregate child records
`derives-from`	This entity was produced from another entity’s data	Lineage tracking only — no data enrichment
`is-a`	This entity is a specialisation of another (ontology hierarchy)	Inheritance in the semantic model

The semantic type drives how the pipeline compiles the relationship. A references relationship generates an enrichment join. A derives-from relationship generates a lineage record in Collibra but does not affect the data views.

Authoring workflow

Relationships are authored in Git and pushed to RTiS via CI. This gives teams a familiar review process while building toward RTiS as the long-term source of truth.

Relationship authoring lifecycle — Phase 1: developer authoring in Git, Phase 2: automated discovery from databases, Phase 3: steward UI in RTiS

Phase 1 — Developer authoring (now)

Teams define relationships in a manifest file committed to Git. A pull request triggers validation (do the referenced entities exist? are the types valid?). On merge, CI pushes the definitions to RTiS GraphDB. The next pipeline run pulls the updated graph.

Phase 2 — Automated discovery (mid-term)

The profile module connects to upstream databases, extracts existing relationship constraints, and proposes new relationships via pull request. Developers review, adjust, and merge — the same workflow as Phase 1, but seeded automatically rather than manually.

Phase 3 — Steward UI (long-term)

RTiS provides a visual interface for data stewards to define and manage relationships directly. The pipeline pulls from RTiS. Git becomes a frozen snapshot for reproducible builds, not the primary authoring surface.

Design decisions

Decision	Rationale
Iceberg for Bronze	Open format readable by any engine. Schema evolution without downtime. No vendor lock-in.
RTiS as relationship source of truth	GraphDB handles millions of entity connections. One query returns the entire graph.
Git as initial authoring surface	Familiar workflow for teams. Auditable history. CI-validated before any change reaches production.
Snowflake catalog metadata	Horizon and Cortex Analyst read relationship metadata natively. No additional tooling needed for discovery.
Quality enforcement at read time	Bronze never loses data. Quality rules can evolve without re-ingestion. Issues are visible, not hidden by rejection.
Semantic types on relationships	Different relationship types produce different outputs. Distinguishing “references” from “derives-from” prevents unnecessary joins and keeps lineage clean.
Domain-level relationship manifest	A single view of all entity connections avoids fragmented ownership and makes the graph navigable.