Saltar a contenido

Deletion Vectors

Deletion Vectors in Delta Lake

🔹 What are Deletion Vectors?

Normally, when you delete rows in a Delta table, Delta rewrites entire Parquet files without those rows.

This is called copy-on-write → expensive for big tables.

Deletion Vectors (DVs) are a new optimization:

Instead of rewriting files, Delta just marks the deleted rows with a bitmap (a lightweight “mask”). The data is still physically there, but readers skip the “deleted” rows.

Think of it like putting a red X mark ❌ on rows instead of erasing them immediately.

🔹 Why are they useful?

🚀 Much faster deletes/updates/merges (because files aren’t rewritten).

⚡ Less I/O → good for big data tables.

✅ Efficient for streaming + time travel.

Example Without deletion vectors

  1. Create a sales table
CREATE TABLE dev.bronze.sales as 
select * from 
read_files(
  'dbfs:/databricks-datasets/online_retail/data-001/data.csv',
  header => true,
  format => 'csv'
)
  1. Set Deletion Vectors false
ALTER TABLE dev.bronze.sales SET TBLPROPERTIES (delta.enableDeletionVectors = false);

image

  1. Delete some rows
-- delete InvoiceNo = '540644'
delete from dev.bronze.sales
where InvoiceNo = '540644'
  1. Describe history

image

Observe that all rows (65000+) are removed and rewritten.

Example with deletion vectors

image

We can see that one deletion vector is added no files are rewritten.

Running optimize would remove those files / records.