Saltar a contenido

Batch vs Streaming

Delta Live Tables : Batch vs Streaming Workloads

Here’s a detailed breakdown of batch vs streaming workloads for Delta Live Tables (DLT) streaming table sources and when to use each:


1. Conceptual Difference

Aspect Batch Workload Streaming Workload
Data Processing Processes a finite dataset at once. Continuously processes new data as it arrives.
Latency High (runs periodically). Low (near real-time).
Trigger Manual or scheduled (e.g., daily, hourly). Continuous / micro-batches.
State Management Not required; table is fully recomputed or append-only. Required; DLT tracks which data has already been processed.
Use Cases Historical analytics, daily aggregates, backfills. Real-time dashboards, alerts, event processing, fraud detection.

2. How DLT Handles Sources

DLT streaming table sources can come from:

  1. Streaming sources

  2. Kafka, Event Hubs, cloud storage (cloudFiles), Delta tables.

  3. Automatically incremental, supports schema evolution.
  4. Example (Python):
@dlt.table
def streaming_sales():
    return (
        spark.readStream.format("cloudFiles")
             .option("cloudFiles.format", "json")
             .load("/mnt/sales/json")
    )
  1. Batch sources

  2. Existing Delta tables, Parquet, CSV, or JDBC.

  3. DLT can treat these as batch reads if no streaming input is detected.
  4. Example (Python):
@dlt.table
def daily_sales():
    return spark.read.format("delta").load("/mnt/sales/delta")

DLT automatically detects whether the source is streaming or batch, but you can explicitly control behavior with spark.readStream vs spark.read.


3. Streaming Table Behavior in DLT

Feature Behavior for Batch Sources Behavior for Streaming Sources
Incremental Processing Only new runs; full dataset read each time unless optimized. DLT tracks offsets/checkpoints to process only new data.
State Management Not needed. DLT maintains state to support upserts, aggregations, joins.
Dependencies Upstream changes processed on next pipeline run. Changes propagate continuously downstream (materialized views get updated incrementally).
Latency Minutes to hours (depends on schedule). Seconds to minutes (micro-batch).
Use Case in DLT Backfill historical data, batch pipelines. Real-time dashboards, streaming aggregations, event-driven analytics.

4. Choosing Between Batch and Streaming

Scenario Preferred Workload
Daily sales report from last month Batch
Real-time fraud detection on transactions Streaming
Hourly ETL from a static CSV dump Batch
Continuous clickstream analytics for dashboards Streaming
Incremental aggregation from an append-only Delta table Can be either, depending on freshness requirements

5. Practical Notes for DLT

  1. Streaming tables can consume batch data

  2. You can define a streaming table that reads from a Delta table (spark.readStream.format("delta")), so it behaves like streaming even if the upstream table is batch.

  3. Materialized views downstream

  4. Always incremental; DLT ensures updates propagate automatically.

  5. Checkpointing

  6. DLT automatically handles checkpoints for streaming sources. You don’t need to manage offsets manually.


Summary:

  • Batch workloads → finite, scheduled processing.
  • Streaming workloads → continuous, low-latency processing with state management.
  • DLT streaming tables adapt to both but are most powerful when the source is streaming.

Do you want me to make that diagram?