Databricks Control Plane and Data Plane¶
🚀 What is Databricks Lakehouse Architecture?¶
Traditionally, companies had two separate systems:
- Data Lake 🪣 (cheap storage, e.g., Azure Data Lake, S3, Blob): stores raw structured, semi-structured, unstructured data → flexible but lacks strong data management (ACID, governance, BI).
- Data Warehouse 🏢 (expensive but fast): optimized for SQL queries, BI, and analytics → great schema enforcement and governance but limited flexibility and costly.
🔹 The Lakehouse combines both in one system:
- The low-cost, flexible storage of a data lake
- The governance, ACID transactions, performance of a warehouse
🏗️ Core Components of Databricks Lakehouse¶
-
Storage Layer (Data Lake foundation)
-
Data stored in open formats like Parquet, ORC, Avro, Delta.
-
Uses cloud object storage (e.g., Azure Data Lake Storage Gen2, AWS S3, GCS).
-
Delta Lake (the secret sauce 🧂)
-
Adds ACID transactions on top of data lake storage.
- Provides schema enforcement, schema evolution, time travel, data versioning.
-
Solves problems like “eventual consistency” and corrupted files in raw data lakes.
-
Unified Governance (Unity Catalog)
-
Centralized metadata & permissions for files, tables, ML models, dashboards.
-
Manages security, lineage, and data discovery across the Lakehouse.
-
Compute Layer (Databricks Runtime / Spark + Photon)
-
Uses Apache Spark + Photon execution engine for batch, streaming, ML, BI.
-
Same engine for ETL, streaming, AI, SQL queries → no silos.
-
Data Management Features
-
Streaming + Batch = One Pipeline (via Delta Live Tables).
- Materialized Views, Incremental Processing, Change Data Capture (CDC).
- MLflow integration for machine learning lifecycle management.
📊 Architecture Diagram (Conceptual Flow)¶
┌───────────────────────┐
│ Business Apps / BI │
│ (Power BI, Tableau) │
└─────────▲─────────────┘
│
┌─────────┴─────────────┐
│ Databricks SQL │
│ & Photon Engine │
└─────────▲─────────────┘
│
┌──────────────────────┴────────────────────────┐
│ Delta Lake (ACID, Schema, CDC) │
│ (Open Storage Format on Parquet + Log) │
└──────────────────────▲────────────────────────┘
│
┌─────────┴─────────────┐
│ Cloud Object Store │
│ (ADLS, S3, GCS) │
└───────────────────────┘
⚡ Benefits of Lakehouse¶
- ✅ One platform → no need for separate warehouse + lake.
- ✅ Cost efficient → cheap storage, scalable compute.
- ✅ Flexibility → structured + semi-structured + unstructured.
- ✅ ACID reliability → transactions, schema enforcement.
- ✅ End-to-end → supports ETL, real-time streaming, ML/AI, BI in the same system.
🌐 In Databricks Azure Context¶
- Storage → Azure Data Lake Storage (ADLS Gen2)
- Security/Governance → Azure Key Vault + Unity Catalog
- Compute → Databricks Clusters with Photon
- Serving → Power BI (Direct Lake Mode)
Data Plane vs Control Plane¶
🚦 1. Simple Analogy¶
Think of Databricks like Uber:
- Control Plane = Uber App 📱 → handles where you go, who drives, billing, monitoring.
- Data Plane = The Car 🚗 → where the actual ride happens (your data processing).
So, Databricks separates management functions (control) from execution functions (data).
🏗️ 2. Databricks Architecture¶
🔹 Control Plane¶
- Managed by Databricks itself (runs in Databricks’ own AWS/Azure/GCP accounts).
-
Contains:
-
Web UI / REST API → you log in here, create clusters, manage jobs.
- Cluster Manager → decides how to spin up VMs/compute.
- Job Scheduler → triggers pipelines, notebooks, workflows.
- Metadata Storage → notebooks, workspace configs, Unity Catalog metadata.
- Monitoring / Logging → cluster health, job logs, error reporting.
⚠️ Important: Your raw data does not go here. This plane is about orchestration, configs, and metadata.
🔹 Data Plane¶
- Runs inside your cloud account (your subscription/project).
-
Contains:
-
Clusters/Compute (Spark Executors, Driver, Photon) → where the data is processed.
- Your Data → stored in ADLS, S3, or GCS.
- Networking → VNETs, Private Endpoints, Peering.
- Libraries / Runtime → Spark, Delta Lake, MLflow, etc.
⚠️ Key Point: The actual data never leaves your cloud account. Processing happens within your boundary.
🔐 3. Security Perspective¶
-
Control Plane:
-
Managed by Databricks.
- Contains metadata, credentials, configs, but not raw data.
-
Can be hardened with SCIM, SSO, RBAC, IP Access Lists.
-
Data Plane:
-
Fully inside your cloud subscription.
- Your sensitive data (PII, transactions, crypto, etc.) never touches Databricks’ account.
-
You control networking:
- Private Link / VNET Injection → ensures traffic never goes over the public internet.
- Key Vault / KMS for secrets.
- Storage firewalls.
🖼️ 4. Architecture Diagram¶
┌──────────────────────────────────────┐
│ CONTROL PLANE │
│ (Databricks-managed account) │
│ │
│ - Web UI / API │
│ - Cluster Manager │
│ - Job Scheduler │
│ - Unity Catalog Metadata │
│ - Logs / Monitoring │
└───────────────▲──────────────────────┘
│
│ Secure REST/API Calls
│
┌───────────────┴──────────────────────┐
│ DATA PLANE │
│ (Your cloud subscription/project) │
│ │
│ - Spark Driver & Executors │
│ - Photon Engine │
│ - Data in ADLS/S3/GCS │
│ - Networking (VNET, Firewall, PEs) │
│ - Secrets from Key Vault/KMS │
└───────────────────────────────────────┘
⚡ 5. Why This Separation?¶
✅ Security → Your data never leaves your account. ✅ Scalability → Databricks manages orchestration, you manage compute. ✅ Multi-cloud → Same control plane works across AWS, Azure, GCP. ✅ Compliance → Helps with HIPAA, GDPR, financial regulations.
🔑 6. Special Feature: Databricks Serverless SQL¶
- Here, the data plane compute is managed by Databricks too (not your account).
- Good for quick BI queries (like Power BI), but some enterprises avoid it for sensitive data.