Introducción
Lakeflow Declarative Pipelines Introduction¶
Here’s a clear introduction to Lakeflow declarative pipelines in Databricks:
1. What is Lakeflow?¶
Lakeflow is a declarative pipeline framework by Databricks designed to simplify building and managing data pipelines over Delta Lake. Unlike traditional ETL pipelines where you imperatively write every transformation step, Lakeflow allows you to declare the desired end state of your data and lets Databricks handle the execution, orchestration, and dependency management.
Think of it like “telling Databricks what you want, not how to do it”.
2. Key Concepts¶
Concept | Description |
---|---|
Pipeline | A logical collection of transformations (like ETL jobs) defined declaratively. |
Lakeflow Table | A Delta table managed by Lakeflow, which tracks lineage, schema, and dependencies. |
Declarative Config | JSON/YAML-like specification describing sources, transformations, and targets. |
State Management | Lakeflow keeps track of which data has been processed and ensures idempotent updates. |
Incremental Processing | Automatically detects new/changed data and applies transformations incrementally. |
3. How It Differs from Normal Pipelines¶
Feature | Traditional ETL | Lakeflow Declarative Pipeline |
---|---|---|
Definition | Imperative: step-by-step code | Declarative: specify inputs, outputs, transformations |
Orchestration | Manual or Airflow/Scheduler | Built-in, dependency-aware orchestration |
Data Lineage | Requires extra tooling | Automatic tracking of lineage between tables |
Error Handling | Manual retries | Automatic state management & retries |
Incremental Loads | Developer writes logic | Lakeflow detects changes & processes incrementally |
4. Basic Pipeline Flow¶
- Define sources: Raw Delta tables, cloud storage, or external systems.
- Declare transformations: For example, aggregations, joins, or enrichments.
- Specify targets: Delta tables managed by Lakeflow.
- Run pipeline: Databricks ensures only necessary transformations run, handles dependencies, and maintains consistency.
5. Advantages¶
- Less boilerplate code → focus on business logic, not orchestration.
- Automatic incremental updates → avoids reprocessing all data.
- Built-in lineage and auditing → helps with compliance and debugging.
- Easier pipeline management → declarative config files version-controlled like code.
6. Example (Conceptual)¶
# Sample declarative pipeline config
pipeline_name: sales_pipeline
tables:
- name: raw_sales
source: s3://data/raw_sales
- name: sales_summary
depends_on: raw_sales
transformations:
- type: aggregate
group_by: ["region", "category"]
metrics:
- name: total_sales
operation: sum(amount)
Here, you declare the desired summary table, and Lakeflow automatically handles reading raw_sales
, performing aggregation, and writing sales_summary
incrementally.
Here’s a detailed breakdown of Streaming Tables, Materialized Views, and Normal Views in Databricks DLT (Delta Live Tables) and their typical usage:
1. Streaming Tables (aka Live Tables)¶
Definition: A streaming table in DLT is a Delta table that is continuously updated as new data arrives. It’s typically used for real-time pipelines.
Key Characteristics:
Feature | Description |
---|---|
Data Ingestion | Continuous from streaming sources (Kafka, Event Hubs, cloud storage). |
Storage | Delta table on Databricks (supports ACID). |
Processing Mode | Append or Upsert (via MERGE ). |
Latency | Near real-time; updates as soon as micro-batches are processed. |
Schema Evolution | Supported, can evolve automatically or manually. |
Use Case:
- Real-time dashboards (e.g., sales, stock prices, payments).
- Event-driven pipelines (e.g., fraud detection, monitoring trades).
Example in DLT (Python):
import dlt
from pyspark.sql.functions import *
@dlt.table
def streaming_sales():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/sales/json")
.withColumn("processed_time", current_timestamp())
)
2. Materialized Views¶
Definition: A materialized view is a Delta table that stores the precomputed results of a query. In DLT, it is a table whose content is automatically refreshed based on its dependencies.
Key Characteristics:
Feature | Description |
---|---|
Data Storage | Delta table with physical storage (unlike normal views). |
Refresh | Incremental refresh based on upstream table changes. |
Performance | Faster query since data is precomputed. |
Up-to-date | Always consistent with underlying tables (incrementally). |
Use Case:
- Aggregations (e.g., daily sales per region).
- Joins and transformations that are expensive to compute on demand.
- Serving layer for dashboards or ML pipelines.
Example in DLT (Python):
import dlt
from pyspark.sql.functions import sum
@dlt.table
def sales_summary():
sales = dlt.read("LIVE.streaming_sales")
return sales.groupBy("region").agg(sum("amount").alias("total_sales"))
Here
sales_summary
is materialized: it stores the aggregated totals physically and refreshes as new streaming data arrives.
3. Normal Views¶
Definition: A normal view in DLT is like a virtual table: it does not store data physically. Instead, it executes the underlying query each time you read it.
Key Characteristics:
Feature | Description |
---|---|
Data Storage | None; only query definition stored. |
Performance | Slower for large or complex queries (computed on-demand). |
Up-to-date | Always reflects the latest upstream data. |
Incremental Processing | Not directly supported (since it’s virtual). |
Use Case:
- Lightweight transformations that do not need storage.
- Ad-hoc analytics.
- Debugging or temporary transformations in pipelines.
Example in DLT (Python):
import dlt
@dlt.view
def high_value_sales():
sales = dlt.read("LIVE.streaming_sales")
return sales.filter(sales.amount > 10000)
high_value_sales
is a normal view: it does not store any data and computes results on demand.
4. Quick Comparison¶
Feature | Streaming Table | Materialized View | Normal View |
---|---|---|---|
Storage | Delta (persistent) | Delta (precomputed) | None |
Latency | Near real-time | Incrementally refreshed | On-demand |
Performance | High for reads | High (precomputed) | Low for complex queries |
Use Case | Real-time ingestion | Aggregations, dashboards | Lightweight transformations, ad-hoc analysis |
✅ Summary / Guidance:
- Use streaming tables for continuous ingestion from sources.
- Use materialized views when you want precomputed results for fast reads and dashboards.
- Use normal views for on-the-fly filtering or temporary logical tables.