Saltar a contenido

Liquid Clustering

Liquid Clustering in Delta Lake and Databricks

🔹 Traditional Partitioning (the old way)

When you create a Delta table, you pick a partition column (e.g., date).

Data is physically stored in folders like:

/table/date=2025-08-25/
/table/date=2025-08-26/

Queries on date are very fast (partition pruning).

But… problems:

You must choose the partition column upfront (hard to change later). Skew → some partitions get huge, others tiny. If you query on a different column (say country), partitioning doesn’t help.

🔹 What is Liquid Clustering?

Liquid Clustering is next-gen partitioning without rigid partitions.

Instead of fixed folder partitions, Delta uses clustering columns.

Data is automatically organized into files that are co-located based on clustering keys.

No fixed directories — clustering boundaries are “liquid,” meaning they can shift over time.

Think of it like:

Partitioning = chopping the cake into fixed slices 🍰. Liquid Clustering = marbling the cake so flavors are naturally grouped but flexible 🌀.

image

🧩 The root problem: concurrent writes

In a traditional partitioned Delta table:

If two jobs write to the same partition folder (say date=2025-08-25), they may overwrite each other’s files, create tons of small files, or cause conflicts.

Delta’s transaction log ( _delta_log ) prevents corruption, but still you can get:

  • Write conflicts
  • Compaction/reorg problems
  • Skewed partitions

🌀 What Liquid Clustering does differently

Liquid Clustering removes the dependency on static partition folders.

There is no date=2025-08-25/ folder.

Instead, data is stored in files spread across the table storage, tagged internally with clustering metadata.

When multiple jobs write:

  • The Delta transaction log coordinates atomic commits.
  • Writers don’t fight for the same fixed folder (no "hotspot").
  • Databricks automatically distributes new rows into the right clustering ranges.

⚡ How concurrent writes are prevented

Transaction log serialization

Every write creates a new JSON transaction in _delta_log. If two jobs conflict, Delta retries or errors out gracefully — no corruption.

No rigid partitions

Since clustering is "liquid", two writers can both insert data with the same date or country values. Delta decides file placement dynamically (not tied to a single folder).

Background clustering

Databricks runs auto-optimization jobs to maintain clustering quality. Even if concurrent writes scatter data, the optimizer later reorganizes files.

Reduced small files problem

With partitions, concurrent writers often create many tiny files in the same folder. With Liquid Clustering, writers spread load across cluster ranges → fewer hotspots.

🍕 Pizza Shop Analogy

Imagine you and your friends are delivering pizzas to an office building.

Old Way (Partitions)

The building has one mailbox per floor. If two delivery guys (writers) come to the same floor mailbox at the same time, they fight for space. The mailbox gets messy, pizzas overlap, and sometimes one delivery overwrites the other. This is like partitioned Delta tables → if two jobs write to the same partition folder, conflicts happen.

New Way (Liquid Clustering)

Now the building switches to smart lockers (clustering ranges).

When a delivery comes in, the system automatically assigns any free locker on that floor.

Two delivery guys can deliver pizzas for the same floor at the same time, but the system spreads them across different lockers.

Later, the building staff reorganizes lockers (background clustering) so pizzas for the same person are grouped together neatly.

This is like Liquid Clustering → no fixed folders, data is dynamically placed, and reorganized in the background.

The Delta log is like the building’s register that records every pizza delivered → so no one loses track.

Are there Trade Offs?

With partitions (mailboxes), if you know the floor (partition key), you go directly to that mailbox — super fast 🚀 for point lookups.

With liquid clustering (smart lockers), pizzas for the same floor (or customer) might be spread across multiple lockers. To find all pizzas for “floor 5,” you may have to open several lockers instead of one → sounds slower, right?

Why it’s not actually that slow in practice:

Clustering index in metadata

Delta keeps track of where rows are stored (think: a digital map of which lockers hold floor 5 pizzas). Readers don’t randomly scan every file; they check the index and skip irrelevant files.

File skipping + statistics

Each data file stores min/max values of the clustering column. So if you query “customer_id = 123,” Delta can skip 90% of files if their min/max range doesn’t cover 123.

Background reclustering

Liquid clustering reorganizes lockers in the background, so “similar pizzas” get grouped closer over time. This means queries get faster the more the system reclusters.

Trade-off (balanced)

Old partitions → fast for single key lookups, but slow for big aggregations (because partitions may be uneven/skewed). Liquid clustering → slightly slower for tiny point lookups, but much faster and balanced for mixed workloads (point lookups + large scans).