Batch vs Streaming
Delta Live Tables : Batch vs Streaming Workloads¶
Here’s a detailed breakdown of batch vs streaming workloads for Delta Live Tables (DLT) streaming table sources and when to use each:
1. Conceptual Difference¶
Aspect | Batch Workload | Streaming Workload |
---|---|---|
Data Processing | Processes a finite dataset at once. | Continuously processes new data as it arrives. |
Latency | High (runs periodically). | Low (near real-time). |
Trigger | Manual or scheduled (e.g., daily, hourly). | Continuous / micro-batches. |
State Management | Not required; table is fully recomputed or append-only. | Required; DLT tracks which data has already been processed. |
Use Cases | Historical analytics, daily aggregates, backfills. | Real-time dashboards, alerts, event processing, fraud detection. |
2. How DLT Handles Sources¶
DLT streaming table sources can come from:
-
Streaming sources
-
Kafka, Event Hubs, cloud storage (
cloudFiles
), Delta tables. - Automatically incremental, supports schema evolution.
- Example (Python):
@dlt.table
def streaming_sales():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/sales/json")
)
-
Batch sources
-
Existing Delta tables, Parquet, CSV, or JDBC.
- DLT can treat these as batch reads if no streaming input is detected.
- Example (Python):
DLT automatically detects whether the source is streaming or batch, but you can explicitly control behavior with
spark.readStream
vsspark.read
.
3. Streaming Table Behavior in DLT¶
Feature | Behavior for Batch Sources | Behavior for Streaming Sources |
---|---|---|
Incremental Processing | Only new runs; full dataset read each time unless optimized. | DLT tracks offsets/checkpoints to process only new data. |
State Management | Not needed. | DLT maintains state to support upserts, aggregations, joins. |
Dependencies | Upstream changes processed on next pipeline run. | Changes propagate continuously downstream (materialized views get updated incrementally). |
Latency | Minutes to hours (depends on schedule). | Seconds to minutes (micro-batch). |
Use Case in DLT | Backfill historical data, batch pipelines. | Real-time dashboards, streaming aggregations, event-driven analytics. |
4. Choosing Between Batch and Streaming¶
Scenario | Preferred Workload |
---|---|
Daily sales report from last month | Batch |
Real-time fraud detection on transactions | Streaming |
Hourly ETL from a static CSV dump | Batch |
Continuous clickstream analytics for dashboards | Streaming |
Incremental aggregation from an append-only Delta table | Can be either, depending on freshness requirements |
5. Practical Notes for DLT¶
-
Streaming tables can consume batch data
-
You can define a streaming table that reads from a Delta table (
spark.readStream.format("delta")
), so it behaves like streaming even if the upstream table is batch. -
Materialized views downstream
-
Always incremental; DLT ensures updates propagate automatically.
-
Checkpointing
-
DLT automatically handles checkpoints for streaming sources. You don’t need to manage offsets manually.
✅ Summary:
- Batch workloads → finite, scheduled processing.
- Streaming workloads → continuous, low-latency processing with state management.
- DLT streaming tables adapt to both but are most powerful when the source is streaming.
Do you want me to make that diagram?