Architecture
Purpose
This document explains how Open Data Platform is structured, how data flows through it, and where each major responsibility lives.
Problem and Scope
The platform solves a common analytics delivery problem: ingesting external and internal data, transforming it into trusted datasets, publishing metadata, and serving decision-ready outputs.
In-scope:
- Batch ingestion and transformation
- Metadata and lineage publication
- BI serving and operator workflows
- Platform observability and QA controls
Out-of-scope:
- Real-time event streaming as a first-class pipeline pattern
- Multi-tenant isolation model
- Production-grade HA guarantees for all services
System Context
flowchart LR
Users["Data Engineers / Analysts / Platform Ops"] --> Portal["Frontend Portal"]
Users --> AirflowUI["Airflow UI"]
Users --> SupersetUI["Superset UI"]
Users --> DataHubUI["DataHub UI"]
External["External Data APIs and Feeds\nCBS, Adzuna, UWV, RSS, Sitemap"] --> Pipelines["Pipelines and Connectors"]
Pipelines --> Lakehouse["Lakehouse Storage\nMinIO Bronze/Silver/Gold"]
Pipelines --> Warehouse["Postgres Warehouse"]
Pipelines --> Metadata["DataHub Metadata Platform"]
Warehouse --> SupersetUI
Metadata --> DataHubUI
Pipelines --> O11y["OpenTelemetry + Prometheus/Loki/Tempo"]
AirflowUI --> O11y
Warehouse --> O11y
Component View
Operator Plane
- Frontend (
frontend/): React launchpad linking all platform surfaces - Airflow UI: DAG operations and task-level monitoring
- DataHub UI: metadata catalog and lineage exploration
- Superset UI: BI dashboards and ad hoc query surface
- Grafana: observability dashboards
Control Plane
- DAGs (
dags/): orchestration definitions - Airflow scheduler/webserver/init jobs
- QA suites in
tests/for quality, contracts, governance, and E2E validation
Data Plane
- MinIO:
bronze,silver,goldbuckets for medallion layering - Postgres warehouse: serving tables and dbt outputs
- DataHub: GMS + Kafka + Elasticsearch + MySQL for metadata operations
Supporting Plane
- Observability stack in
ops/observability/ - Keycloak realm and SSO config in
ops/keycloak/ - Portal API telemetry endpoints
Runtime Data Flow
The primary domain flow implemented today is odp_staffing_demand.
flowchart TD
A["CBS / Adzuna / UWV"] --> B["Bronze tables"]
B --> C["Silver tables"]
C --> D["Gold tables"]
D --> E["Postgres warehouse schema: odp_staffing_demand"]
E --> F["Superset dashboards"]
D --> G["dbt models"]
G --> E
C --> H["Data quality checks"]
D --> H
E --> H
Metadata and Governance Flow
flowchart LR
Schema["schema/*.yaml + warehouse.dbml"] --> Validators["Validation scripts"]
Validators --> CI["GitHub Actions"]
Warehouse["Postgres warehouse"] --> CatalogScript["register_datahub_catalog.py"]
Schema --> SyncScript["sync_dbml_to_datahub.py"]
CatalogScript --> DataHub["DataHub GMS"]
SyncScript --> DataHub
DataHub --> Discover["Search, lineage, ownership"]
Deployment Model
- Local development: Docker Compose (
docker-compose.yml) - Local Kubernetes: kind cluster via
scripts/k8s/k8s_dev_up.sh - Cloud: AKS provisioning/deploy via
scripts/aks/aks_up.shwith Key Vault-backed secret sync - Scaleway: Kapsule deployment via
scripts/aks/scaleway_redeploy_all.sh
More detail: Deployment Guide
Architecture Decisions
- Keep both Spark-compatible and Postgres/dbt-native transformation paths
- Use schema-as-code plus QA policy checks as governance baseline
- Favor composable OSS services instead of tightly coupled platform products