Saltar a contenido

Storage

Data Storage Internals, Checkpoints and Renaming Streaming Tables and Views

Here’s a detailed explanation of how Delta Live Tables (DLT) handles storage for streaming tables, pipeline dependency, and renaming behavior:


1. Where is data for streaming tables stored?

  • Streaming tables in DLT are stored as Delta tables on the Databricks File System (DBFS) or your cloud storage configured for the pipeline, e.g., S3, ADLS Gen2, or GCS.
  • Every streaming table has a physical Delta table location, even though you define it declaratively in DLT.
  • The storage location is typically managed by the DLT pipeline, but you can explicitly configure it in advanced pipeline settings.

Key points:

  • Incremental state (processed offsets, checkpoints) is stored in _system-managed checkpoints within the pipeline’s storage path.
  • Upserts and merges are persisted in the Delta table itself.
  • Data retention and compaction follow normal Delta table rules.

2. Is storage dependent on the pipeline?

Yes, partially:

  • Pipeline-specific storage: Each DLT pipeline manages its own metadata and checkpoints for the streaming tables it owns.
  • Shared tables: If multiple pipelines reference the same Delta table (e.g., using LIVE.<table_name>), the physical Delta table is shared, but each pipeline maintains its own lineage and state metadata.

Implication:

  • Deleting a pipeline does not delete the underlying Delta table automatically, unless you explicitly choose managed tables.
  • Changing pipelines (like moving a table to a different pipeline) requires careful handling to avoid breaking downstream dependencies.

3. What happens when we rename streaming tables?

  • DLT does not support a “rename” operation in place for streaming tables.
  • If you rename a table in DLT:

  • The new table name points to a new managed object in the pipeline.

  • The underlying Delta data is copied or remapped depending on configuration.
  • Any downstream references (LIVE.<old_name>) break unless you update them to the new name.
  • Best practice:

  • Avoid renaming streaming tables in active pipelines.

  • If renaming is needed, create a new table with the desired name and point downstream materialized views or pipelines to it.

4. Practical Notes / Recommendations

Aspect Recommendation
Storage Let DLT manage default Delta table locations unless you need a custom path.
Pipeline dependency Be aware that streaming tables are tied to pipeline metadata (checkpoints, lineage).
Renaming Prefer creating a new table and updating downstream references; avoid in-place renames for live pipelines.
Backup If renaming or moving, snapshot or backup Delta tables to avoid data loss.

Summary:

  1. Streaming tables always persist data as Delta tables in the pipeline’s storage.
  2. Storage and checkpoints are pipeline-dependent, but the data itself can be shared.
  3. Renaming a streaming table breaks dependencies; best approach is to create a new table instead of renaming.