Architecture

Purpose

This document explains how Open Data Platform is structured, how data flows through it, and where each major responsibility lives.

Problem and Scope

The platform solves a common analytics delivery problem: ingesting external and internal data, transforming it into trusted datasets, publishing metadata, and serving decision-ready outputs.

In-scope:

  • Batch ingestion and transformation
  • Metadata and lineage publication
  • BI serving and operator workflows
  • Platform observability and QA controls

Out-of-scope:

  • Real-time event streaming as a first-class pipeline pattern
  • Multi-tenant isolation model
  • Production-grade HA guarantees for all services

System Context

flowchart LR
  Users["Data Engineers / Analysts / Platform Ops"] --> Portal["Frontend Portal"]
  Users --> AirflowUI["Airflow UI"]
  Users --> SupersetUI["Superset UI"]
  Users --> DataHubUI["DataHub UI"]

  External["External Data APIs and Feeds\nCBS, Adzuna, UWV, RSS, Sitemap"] --> Pipelines["Pipelines and Connectors"]

  Pipelines --> Lakehouse["Lakehouse Storage\nMinIO Bronze/Silver/Gold"]
  Pipelines --> Warehouse["Postgres Warehouse"]
  Pipelines --> Metadata["DataHub Metadata Platform"]

  Warehouse --> SupersetUI
  Metadata --> DataHubUI

  Pipelines --> O11y["OpenTelemetry + Prometheus/Loki/Tempo"]
  AirflowUI --> O11y
  Warehouse --> O11y

Component View

Operator Plane

  • Frontend (frontend/): React launchpad linking all platform surfaces
  • Airflow UI: DAG operations and task-level monitoring
  • DataHub UI: metadata catalog and lineage exploration
  • Superset UI: BI dashboards and ad hoc query surface
  • Grafana: observability dashboards

Control Plane

  • DAGs (dags/): orchestration definitions
  • Airflow scheduler/webserver/init jobs
  • QA suites in tests/ for quality, contracts, governance, and E2E validation

Data Plane

  • MinIO: bronze, silver, gold buckets for medallion layering
  • Postgres warehouse: serving tables and dbt outputs
  • DataHub: GMS + Kafka + Elasticsearch + MySQL for metadata operations

Supporting Plane

  • Observability stack in ops/observability/
  • Keycloak realm and SSO config in ops/keycloak/
  • Portal API telemetry endpoints

Runtime Data Flow

The primary domain flow implemented today is odp_staffing_demand.

flowchart TD
  A["CBS / Adzuna / UWV"] --> B["Bronze tables"]
  B --> C["Silver tables"]
  C --> D["Gold tables"]
  D --> E["Postgres warehouse schema: odp_staffing_demand"]
  E --> F["Superset dashboards"]

  D --> G["dbt models"]
  G --> E

  C --> H["Data quality checks"]
  D --> H
  E --> H

Metadata and Governance Flow

flowchart LR
  Schema["schema/*.yaml + warehouse.dbml"] --> Validators["Validation scripts"]
  Validators --> CI["GitHub Actions"]

  Warehouse["Postgres warehouse"] --> CatalogScript["register_datahub_catalog.py"]
  Schema --> SyncScript["sync_dbml_to_datahub.py"]

  CatalogScript --> DataHub["DataHub GMS"]
  SyncScript --> DataHub

  DataHub --> Discover["Search, lineage, ownership"]

Deployment Model

  • Local development: Docker Compose (docker-compose.yml)
  • Local Kubernetes: kind cluster via scripts/k8s/k8s_dev_up.sh
  • Cloud: AKS provisioning/deploy via scripts/aks/aks_up.sh with Key Vault-backed secret sync
  • Scaleway: Kapsule deployment via scripts/aks/scaleway_redeploy_all.sh

More detail: Deployment Guide

Architecture Decisions

  • Keep both Spark-compatible and Postgres/dbt-native transformation paths
  • Use schema-as-code plus QA policy checks as governance baseline
  • Favor composable OSS services instead of tightly coupled platform products