Introducción

Lakeflow Declarative Pipelines Introduction¶

Here’s a clear introduction to Lakeflow declarative pipelines in Databricks:

1. What is Lakeflow?¶

Lakeflow is a declarative pipeline framework by Databricks designed to simplify building and managing data pipelines over Delta Lake. Unlike traditional ETL pipelines where you imperatively write every transformation step, Lakeflow allows you to declare the desired end state of your data and lets Databricks handle the execution, orchestration, and dependency management.

Think of it like “telling Databricks what you want, not how to do it”.

2. Key Concepts¶

Concept	Description
Pipeline	A logical collection of transformations (like ETL jobs) defined declaratively.
Lakeflow Table	A Delta table managed by Lakeflow, which tracks lineage, schema, and dependencies.
Declarative Config	JSON/YAML-like specification describing sources, transformations, and targets.
State Management	Lakeflow keeps track of which data has been processed and ensures idempotent updates.
Incremental Processing	Automatically detects new/changed data and applies transformations incrementally.

3. How It Differs from Normal Pipelines¶

Feature	Traditional ETL	Lakeflow Declarative Pipeline
Definition	Imperative: step-by-step code	Declarative: specify inputs, outputs, transformations
Orchestration	Manual or Airflow/Scheduler	Built-in, dependency-aware orchestration
Data Lineage	Requires extra tooling	Automatic tracking of lineage between tables
Error Handling	Manual retries	Automatic state management & retries
Incremental Loads	Developer writes logic	Lakeflow detects changes & processes incrementally

4. Basic Pipeline Flow¶

Define sources: Raw Delta tables, cloud storage, or external systems.
Declare transformations: For example, aggregations, joins, or enrichments.
Specify targets: Delta tables managed by Lakeflow.
Run pipeline: Databricks ensures only necessary transformations run, handles dependencies, and maintains consistency.

5. Advantages¶

Less boilerplate code → focus on business logic, not orchestration.
Automatic incremental updates → avoids reprocessing all data.
Built-in lineage and auditing → helps with compliance and debugging.
Easier pipeline management → declarative config files version-controlled like code.

6. Example (Conceptual)¶

# Sample declarative pipeline config
pipeline_name: sales_pipeline
tables:
  - name: raw_sales
    source: s3://data/raw_sales
  - name: sales_summary
    depends_on: raw_sales
    transformations:
      - type: aggregate
        group_by: ["region", "category"]
        metrics:
          - name: total_sales
            operation: sum(amount)

Here, you declare the desired summary table, and Lakeflow automatically handles reading raw_sales, performing aggregation, and writing sales_summary incrementally.

Here’s a detailed breakdown of Streaming Tables, Materialized Views, and Normal Views in Databricks DLT (Delta Live Tables) and their typical usage:

1. Streaming Tables (aka Live Tables)¶

Definition: A streaming table in DLT is a Delta table that is continuously updated as new data arrives. It’s typically used for real-time pipelines.

Key Characteristics:

Feature	Description
Data Ingestion	Continuous from streaming sources (Kafka, Event Hubs, cloud storage).
Storage	Delta table on Databricks (supports ACID).
Processing Mode	`Append` or `Upsert` (via `MERGE`).
Latency	Near real-time; updates as soon as micro-batches are processed.
Schema Evolution	Supported, can evolve automatically or manually.

Use Case:

Real-time dashboards (e.g., sales, stock prices, payments).
Event-driven pipelines (e.g., fraud detection, monitoring trades).

Example in DLT (Python):

import dlt
from pyspark.sql.functions import *

@dlt.table
def streaming_sales():
    return (
        spark.readStream.format("cloudFiles")
             .option("cloudFiles.format", "json")
             .load("/mnt/sales/json")
             .withColumn("processed_time", current_timestamp())
    )

2. Materialized Views¶

Definition: A materialized view is a Delta table that stores the precomputed results of a query. In DLT, it is a table whose content is automatically refreshed based on its dependencies.

Key Characteristics:

Feature	Description
Data Storage	Delta table with physical storage (unlike normal views).
Refresh	Incremental refresh based on upstream table changes.
Performance	Faster query since data is precomputed.
Up-to-date	Always consistent with underlying tables (incrementally).

Use Case:

Aggregations (e.g., daily sales per region).
Joins and transformations that are expensive to compute on demand.
Serving layer for dashboards or ML pipelines.

Example in DLT (Python):

import dlt
from pyspark.sql.functions import sum

@dlt.table
def sales_summary():
    sales = dlt.read("LIVE.streaming_sales")
    return sales.groupBy("region").agg(sum("amount").alias("total_sales"))

Here sales_summary is materialized: it stores the aggregated totals physically and refreshes as new streaming data arrives.

3. Normal Views¶

Definition: A normal view in DLT is like a virtual table: it does not store data physically. Instead, it executes the underlying query each time you read it.

Key Characteristics:

Feature	Description
Data Storage	None; only query definition stored.
Performance	Slower for large or complex queries (computed on-demand).
Up-to-date	Always reflects the latest upstream data.
Incremental Processing	Not directly supported (since it’s virtual).

Use Case:

Lightweight transformations that do not need storage.
Ad-hoc analytics.
Debugging or temporary transformations in pipelines.

Example in DLT (Python):

import dlt

@dlt.view
def high_value_sales():
    sales = dlt.read("LIVE.streaming_sales")
    return sales.filter(sales.amount > 10000)

high_value_sales is a normal view: it does not store any data and computes results on demand.

4. Quick Comparison¶

Feature	Streaming Table	Materialized View	Normal View
Storage	Delta (persistent)	Delta (precomputed)	None
Latency	Near real-time	Incrementally refreshed	On-demand
Performance	High for reads	High (precomputed)	Low for complex queries
Use Case	Real-time ingestion	Aggregations, dashboards	Lightweight transformations, ad-hoc analysis

✅ Summary / Guidance:

Use streaming tables for continuous ingestion from sources.
Use materialized views when you want precomputed results for fast reads and dashboards.
Use normal views for on-the-fly filtering or temporary logical tables.