Skip to content

Running Multi-Entity Pipelines

This guide covers the full lifecycle of running the data product pipeline for one or many entities. It explains entity configuration, workspace isolation, the orchestration workflow, artifact promotion, CI integration, failure handling, and debugging.

For module-by-module reference, see the Modules section. For the underlying architecture, see Pipeline Overview.

Terminal window
export RDT_TARGET=dev
# Dry-run — see what would happen, no side-effects
cargo run -p rdt-model-compile -- --target dev --entity waste-tracking --dry-run run
# Real run — spawns all 19 modules across 6 phases
cargo run -p rdt-model-compile -- --target dev --entity waste-tracking run
Terminal window
# Create an isolated workspace, run, promote on success
./scripts/pipeline.sh waste-tracking --target dev

Push to main or trigger manually:

Terminal window
gh workflow run entity-pipeline.yml \
-f target=dev \
-f entities="waste-tracking,organization-site"

Each entity runs as a separate GitHub Actions matrix job. See CI integration below.

The entity list lives in roche-data.toml under [entities]. This is the single source of truth for which entities participate in batch pipeline runs.

[entities]
list = ["waste-tracking", "organization-site"]

Each entity must have:

  • An entry in [rtis.entities.<name>] with its RTiS class and terminology IDs (or placeholder comments if pending)
  • A models/<name>/ directory (created by the first rdt-model-pull run)

To add a new entity, add it to entities.list and create the RTiS mapping:

[entities]
list = ["waste-tracking", "organization-site", "site-energy"]
[rtis.entities.site-energy]
rtis_class_id = "ROX..."
rtis_terminology_id = "ROX..."

The pipeline is organised into 6 sequential phases. Modules within each phase run in parallel.

Pipeline phases — 6 sequential phases with parallel modules

The orchestrator (rdt-model-compile run) spawns each module as a subprocess with --json, captures the result envelope from stdout, and aggregates into a PipelineResult.

Every module supports two invocation modes:

ModeWhenExample
Flag modeInteractive/local userdt-model-pull --target dev --entity waste-tracking pull
Manifest modePipeline orchestrationrdt-model-pull --manifest /tmp/rdt-waste-tracking-abc/pull-manifest.json --json pull

Both modes produce identical behaviour. The manifest provides entity, target, workspace path, input paths, and output directory — everything the module needs in a single JSON file.

Every module emits a ModuleResult JSON envelope when invoked with --json:

{
"module": "rdt-model-store",
"version": "0.1.0",
"status": "ok",
"entity_id": "waste-tracking",
"timestamp": "2026-05-12T10:30:00Z",
"duration_ms": 1200,
"outputs": {
"snowflake/ddl/waste-tracking.sql": "wrote",
"dbt/models/bronze/waste-tracking.sql": "wrote",
"dbt/models/bronze/waste-tracking.yml": "wrote"
},
"errors": []
}

Status values: ok, warning (all outputs skipped/unchanged), partial (some errors but some outputs), error (complete failure).

When running through the orchestrator with --workspace, every pipeline run gets an isolated directory. This prevents parallel runs from interfering with each other.

/tmp/rdt-waste-tracking-a1b2c3d4/
├── pull/ ← Phase 1 outputs
│ ├── model.json
│ └── pull-manifest.json
├── govern/ ← Phase 2 outputs
│ └── governance.json
├── compile/ ← Phase 3 outputs
│ └── artifacts/
│ ├── datacontract.yaml
│ ├── bronze.sql
│ ├── dbt_bronze.sql
│ └── ...
├── deploy/ ← Phase 4 results
├── register/ ← Phase 5 results
├── support/ ← Phase 6 results
├── pipeline-result.json ← Aggregated outcome
└── pipeline-stderr.log ← Tracing output

Each entity and each run gets a unique workspace path:

Entity A, run 1: /tmp/rdt-waste-tracking-{uuid-1}/
Entity A, run 2: /tmp/rdt-waste-tracking-{uuid-2}/
Entity B, run 1: /tmp/rdt-organization-site-{uuid-3}/

Zero contention. No file locks needed.

The workspace root is resolved from environment variables in this order:

  1. $RDT_WORKSPACE_DIR — explicit override
  2. $TMPDIR — system temp directory
  3. /tmp — fallback

Set RDT_WORKSPACE_DIR in CI if the runner’s temp directory has limited space.

After all phases complete successfully, the promote step copies artifacts from the workspace to their canonical repo paths. This is what keeps the repo untouched when a pipeline fails.

Terminal window
rdt-model-compile --target dev --entity waste-tracking promote --workspace /tmp/rdt-waste-tracking-abc

Promote maps workspace-relative paths to repo paths using the centralised paths.rs functions:

Workspace pathRepo path
pull/model.jsonmodels/{entity}/model.json
govern/governance.jsonmodels/{entity}/governance.json
compile/artifacts/datacontract.yamlmodels/{entity}/datacontract.yaml
compile/artifacts/bronze.sqlsnowflake/ddl/{entity}.sql
compile/artifacts/dbt_bronze.sqldbt/models/bronze/{entity}.sql
compile/artifacts/openapi.yamlapis/{entity}/openapi.yaml
compile/artifacts/mcp_tool.jsonapis/{entity}/mcp_tool.json

If a file in the workspace doesn’t exist (module was optional or skipped), promote silently skips it.

{
"entity_id": "waste-tracking",
"promoted": [
{"source": "pull/model.json", "destination": "models/waste-tracking/model.json", "action": "wrote"},
{"source": "compile/artifacts/bronze.sql", "destination": "snowflake/ddl/waste-tracking.sql", "action": "updated"}
],
"skipped": ["infer/suggestions.json"]
}

The entity-pipeline.yml workflow automates multi-entity execution:

CI workflow — resolve matrix, build, parallel pipeline, commit, summary

Reads [entities].list from roche-data.toml and outputs a JSON matrix. You can override via workflow_dispatch with a comma-separated list.

Builds all workspace binaries in release mode and uploads them as a GitHub Actions artifact. This runs once, shared by all entity jobs.

One job per entity, running in parallel. Each job:

  1. Downloads the CLI binaries
  2. Loads secrets from Vault (using JWT auth)
  3. Runs scripts/pipeline.sh <entity> --target <env>
  4. Uploads pipeline-result-<entity>.json as an artifact
  5. On failure, uploads the workspace directory for debugging

Key settings:

  • fail-fast: false — one entity’s failure does not cancel others
  • max-parallel: 5 — limits concurrent runner usage
  • timeout-minutes: 30 — prevents hung jobs

Runs only if all pipeline jobs succeed. Commits promoted artifacts with a conventional commit message and pushes.

Generates a GitHub Step Summary table showing entity status and duration.

Terminal window
# Run all entities from roche-data.toml against dev
gh workflow run entity-pipeline.yml -f target=dev
# Run specific entities against test
gh workflow run entity-pipeline.yml -f target=test -f entities="waste-tracking"
# Run against prod (requires approval)
gh workflow run entity-pipeline.yml -f target=prod

Environment promotion flow — DEV to TEST to PROD with approval gates

Each environment has its own GitHub Environment with protection rules. Vault secrets are scoped per environment (secret/dev/..., secret/test/..., secret/prod/...).

scripts/pipeline.sh is the entry point for each entity pipeline run. It handles workspace lifecycle, orchestration, and promotion in a single script.

Terminal window
./scripts/pipeline.sh <entity> [--target dev|test|prod] [--keep-workspace]

What it does:

  1. Generates a run ID (UUIDv4)
  2. Creates workspace with phase subdirectories
  3. Calls rdt-model-compile run --workspace <ws> --json
  4. Checks the pipeline status from pipeline-result.json
  5. On success (ok or warning): runs rdt-model-compile promote
  6. Copies pipeline-result.json to repo root for CI artifact upload
  7. Cleans up workspace (unless --keep-workspace)

Use --keep-workspace to preserve the workspace for debugging:

Terminal window
./scripts/pipeline.sh waste-tracking --target dev --keep-workspace
ls /tmp/rdt-waste-tracking-*/
ScenarioPipeline statusBehaviour
All required modules in a phase succeedokContinue to next phase
Some required modules failpartialContinue, but pipeline marked partial
All required modules in a phase failerrorStop immediately — no further phases run
Optional module fails (profile, infer)warningContinue, noted in result

In multi-entity CI:

  • Each entity runs independently (fail-fast: false)
  • A failed entity’s workspace is uploaded as a debug artifact
  • Other entities continue unaffected
  • The commit-artifacts job only runs if ALL entities succeed
  1. Download the workspace artifact from the GitHub Actions run
  2. Inspect pipeline-result.json for per-module status
  3. Check pipeline-stderr.log for tracing output
  4. Look at individual module result files in phase subdirectories
  5. Re-run locally with --keep-workspace:
Terminal window
./scripts/pipeline.sh waste-tracking --target dev --keep-workspace
cat /tmp/rdt-waste-tracking-*/pipeline-result.json | jq '.phases[].modules[] | select(.status == "error")'
Terminal window
# Retry just the failed entity
gh workflow run entity-pipeline.yml -f target=dev -f entities="waste-tracking"

The pipeline manifest JSON conforms to cli/common/schemas/modules/pipeline-manifest.schema.json.

FieldTypeRequiredDescription
entity_idstringyesEntity identifier (alphanumeric, hyphens, underscores)
workspacestringyesAbsolute path to workspace root
targetenumyesdev, test, or prod
run_idstringyesUUIDv7 correlation ID
dry_runbooleannoSkip side-effects (default: false)
inputsobjectnoNamed input paths relative to workspace
output_dirstringyesPhase subdirectory for outputs

Example:

{
"entity_id": "waste-tracking",
"workspace": "/tmp/rdt-waste-tracking-abc123",
"target": "dev",
"run_id": "abc123",
"dry_run": false,
"inputs": {
"model": "pull/model.json",
"governance": "govern/governance.json"
},
"output_dir": "deploy"
}
ScenarioProtection
Two entities in parallel CI jobsSeparate runners + separate workspaces. Entity-scoped paths are non-overlapping.
Same entity re-run (retry)Different UUIDv7 = different workspace. No collision.
Shared files (e.g. rules/registry.yaml)Promote runs sequentially in commit-artifacts. Merge if needed.
Failed pipelineWorkspace is never promoted. Repo remains clean.