Skip to content

Semantic Layer on Iceberg

Traditional data platforms bundle three concerns into one system: storage, schema enforcement, and relationship metadata. Apache Iceberg decouples the first — storage becomes an open table format on object storage, readable by any engine. But Iceberg deliberately has no concept of relationships. No primary keys, no foreign keys, no constraints.

This creates a question: where do relationships live?

The roche-data answer: a semantic graph maintained in RTiS, compiled into every consumption layer by the pipeline. Iceberg stores the data. The semantic graph defines how entities connect. The pipeline combines both into consumable views, AI definitions, data contracts, and enterprise catalog entries.

Semantic layer philosophy — RTiS and Iceberg converge through the pipeline into Silver/Gold views, Semantic Views, and Data Contracts

The principle: flat storage, rich meaning. Every Iceberg table is a simple, engine-agnostic append log. All intelligence — types, labels, relationships, quality rules — lives above it as compiled metadata.

ConcernTraditional approachSemantic graph approach
Data qualityReject bad rows at write timeFilter invalid rows at read time — nothing is lost, issues are visible
Relationship discoveryLocked inside one database engineAvailable across all systems — Snowflake, Cortex, Collibra, Horizon
Schema changesRisky migrations with downtimeBackwards-compatible evolution — old and new readers coexist
Multi-engine accessVendor lock-inAny engine reads the same Iceberg data
AI readinessRequires manual configuration per toolSemantic view provides relationships to Cortex Analyst automatically

Quality enforcement at read time means Bronze never rejects data — it lands everything. Silver and Gold views apply the relationship graph to filter, enrich, and validate. This makes the system resilient: raw data is always preserved, and quality rules can be refined without re-ingesting.

The semantic graph flows through the pipeline from source systems to consumption targets. Each step either produces or consumes relationship metadata.

Metadata flows from producers (RTiS, databases, Collibra, developers) through pipeline artifacts to consumers (Snowflake layers, Cortex, Horizon, Data Contract)

Producers — where semantic metadata originates

Section titled “Producers — where semantic metadata originates”
SourceWhat it providesPipeline step
RTiS GraphDBEntity structure, field definitions, terminology bindings, inter-entity relationshipsPull
Upstream databasesPhysical relationship constraints, column metadata, table statisticsProfile
CollibraStewardship rules, ownership, SLA, classification, sensitivity flagsGovern
MRHubMaster data for validity checks — confirms that referenced entities existPolicy
DevelopersRelationship definitions authored in code, validated by CIGit

Consumers — where semantic metadata is delivered

Section titled “Consumers — where semantic metadata is delivered”
ConsumerWhat it receivesPurpose
Snowflake Bronze (Iceberg)Flat tables with catalog-level relationship metadataStorage and catalog visibility
Silver ViewsRelationship-driven enrichment — IDs resolved to human-readable labelsCurated, analyst-ready data
Gold ViewsBusiness metrics enriched with linked entity attributesKPI calculation and reporting
Semantic ViewExplicit relationship declarations for Cortex AnalystAI-driven natural language queries
Data ContractMachine-readable relationship inventoryConsumer discovery and integration
Snowflake HorizonCross-account lineage and governance metadataEnterprise-wide data discovery
CollibraEntity-to-entity lineage edgesGovernance graph and impact analysis

How meaning builds through the medallion layers

Section titled “How meaning builds through the medallion layers”

Each layer adds semantic richness on top of the flat Iceberg storage.

Semantic enrichment through medallion layers — Bronze stores raw IDs, Silver resolves labels, Gold calculates metrics, Semantic enables AI queries

LayerWhat it containsRole of relationships
BronzeRaw identifiers and codes exactly as received from sourceRelationships stored as catalog metadata — not applied to data
SilverIDs plus human-readable labels from linked entitiesRelationships drive label resolution — “SITE_ID” becomes “Basel Packaging Plant”
GoldBusiness metrics with full context from related entitiesRelationships determine which linked attributes are surfaced for KPIs
SemanticAI-queryable model with declared cross-entity connectionsRelationships enable Cortex Analyst to answer multi-entity questions without manual configuration

Example: A product quality record in Bronze contains only SITE_ID = "CH-004". The Silver view uses the relationship graph to join to the Sites entity and add SITE_NAME = "Basel Packaging Plant". The Gold view calculates quality-per-site metrics using this enrichment. The Semantic view tells Cortex Analyst that quality records connect to sites via SITE_ID, enabling natural language questions like “What is the defect rate at Basel this quarter?”

Each relationship connects a property in one entity to a property in another. The definition includes cardinality (how many records connect) and a semantic type (what the connection means).

TypeMeaningEffect on data layers
referencesThis property points to a record in another entitySilver/Gold views resolve the linked record’s attributes
classifiesThis property maps to a controlled vocabulary or taxonomyViews resolve the code to its display label
containsOne entity owns instances of another (parent-child)Gold views can nest or aggregate child records
derives-fromThis entity was produced from another entity’s dataLineage tracking only — no data enrichment
is-aThis entity is a specialisation of another (ontology hierarchy)Inheritance in the semantic model

The semantic type drives how the pipeline compiles the relationship. A references relationship generates an enrichment join. A derives-from relationship generates a lineage record in Collibra but does not affect the data views.

Relationships are authored in Git and pushed to RTiS via CI. This gives teams a familiar review process while building toward RTiS as the long-term source of truth.

Relationship authoring lifecycle — Phase 1: developer authoring in Git, Phase 2: automated discovery from databases, Phase 3: steward UI in RTiS

Teams define relationships in a manifest file committed to Git. A pull request triggers validation (do the referenced entities exist? are the types valid?). On merge, CI pushes the definitions to RTiS GraphDB. The next pipeline run pulls the updated graph.

Phase 2 — Automated discovery (mid-term)

Section titled “Phase 2 — Automated discovery (mid-term)”

The profile module connects to upstream databases, extracts existing relationship constraints, and proposes new relationships via pull request. Developers review, adjust, and merge — the same workflow as Phase 1, but seeded automatically rather than manually.

RTiS provides a visual interface for data stewards to define and manage relationships directly. The pipeline pulls from RTiS. Git becomes a frozen snapshot for reproducible builds, not the primary authoring surface.

DecisionRationale
Iceberg for BronzeOpen format readable by any engine. Schema evolution without downtime. No vendor lock-in.
RTiS as relationship source of truthGraphDB handles millions of entity connections. One query returns the entire graph.
Git as initial authoring surfaceFamiliar workflow for teams. Auditable history. CI-validated before any change reaches production.
Snowflake catalog metadataHorizon and Cortex Analyst read relationship metadata natively. No additional tooling needed for discovery.
Quality enforcement at read timeBronze never loses data. Quality rules can evolve without re-ingestion. Issues are visible, not hidden by rejection.
Semantic types on relationshipsDifferent relationship types produce different outputs. Distinguishing “references” from “derives-from” prevents unnecessary joins and keeps lineage clean.
Domain-level relationship manifestA single view of all entity connections avoids fragmented ownership and makes the graph navigable.