Batch vs Streaming

Delta Live Tables : Batch vs Streaming Workloads¶

Here’s a detailed breakdown of batch vs streaming workloads for Delta Live Tables (DLT) streaming table sources and when to use each:

1. Conceptual Difference¶

Aspect	Batch Workload	Streaming Workload
Data Processing	Processes a finite dataset at once.	Continuously processes new data as it arrives.
Latency	High (runs periodically).	Low (near real-time).
Trigger	Manual or scheduled (e.g., daily, hourly).	Continuous / micro-batches.
State Management	Not required; table is fully recomputed or append-only.	Required; DLT tracks which data has already been processed.
Use Cases	Historical analytics, daily aggregates, backfills.	Real-time dashboards, alerts, event processing, fraud detection.

2. How DLT Handles Sources¶

DLT streaming table sources can come from:

Streaming sources
Kafka, Event Hubs, cloud storage (cloudFiles), Delta tables.
Automatically incremental, supports schema evolution.
Example (Python):

@dlt.table
def streaming_sales():
    return (
        spark.readStream.format("cloudFiles")
             .option("cloudFiles.format", "json")
             .load("/mnt/sales/json")
    )

Batch sources
Existing Delta tables, Parquet, CSV, or JDBC.
DLT can treat these as batch reads if no streaming input is detected.
Example (Python):

@dlt.table
def daily_sales():
    return spark.read.format("delta").load("/mnt/sales/delta")

DLT automatically detects whether the source is streaming or batch, but you can explicitly control behavior with spark.readStream vs spark.read.

3. Streaming Table Behavior in DLT¶

Feature	Behavior for Batch Sources	Behavior for Streaming Sources
Incremental Processing	Only new runs; full dataset read each time unless optimized.	DLT tracks offsets/checkpoints to process only new data.
State Management	Not needed.	DLT maintains state to support upserts, aggregations, joins.
Dependencies	Upstream changes processed on next pipeline run.	Changes propagate continuously downstream (materialized views get updated incrementally).
Latency	Minutes to hours (depends on schedule).	Seconds to minutes (micro-batch).
Use Case in DLT	Backfill historical data, batch pipelines.	Real-time dashboards, streaming aggregations, event-driven analytics.

4. Choosing Between Batch and Streaming¶

Scenario	Preferred Workload
Daily sales report from last month	Batch
Real-time fraud detection on transactions	Streaming
Hourly ETL from a static CSV dump	Batch
Continuous clickstream analytics for dashboards	Streaming
Incremental aggregation from an append-only Delta table	Can be either, depending on freshness requirements

5. Practical Notes for DLT¶

Streaming tables can consume batch data
You can define a streaming table that reads from a Delta table (spark.readStream.format("delta")), so it behaves like streaming even if the upstream table is batch.
Materialized views downstream
Always incremental; DLT ensures updates propagate automatically.
Checkpointing
DLT automatically handles checkpoints for streaming sources. You don’t need to manage offsets manually.

✅ Summary:

Batch workloads → finite, scheduled processing.
Streaming workloads → continuous, low-latency processing with state management.
DLT streaming tables adapt to both but are most powerful when the source is streaming.

Do you want me to make that diagram?