Azure Data Lake Gen2 Overview¶

🔹 What is ADLS Gen2?¶

Azure Data Lake Storage Gen2 is Microsoft’s enterprise-grade, big data storage service built on top of Azure Blob Storage.
It combines the scalability, durability, and low cost of Blob Storage with a hierarchical namespace (folders & files like a traditional file system).
It’s designed for analytics and big data workloads (Spark, Databricks, Synapse, HDInsight, etc.), while still being general-purpose storage.

Hierarchical Namespace (HNS)
Unlike flat Blob Storage, ADLS Gen2 organizes data into directories and subdirectories.
Enables atomic file operations like rename and move at the directory/file level.
Reduces cost and complexity of working with files in analytics.
Unified Storage
Built on Blob Storage → same account, same data redundancy options, same durability.
No need to maintain separate “data lake” and “blob” accounts.
Optimized for Big Data Analytics
Works natively with Apache Hadoop (HDFS) APIs.
Seamless integration with Azure Databricks, Synapse Analytics, HDInsight, Azure Data Factory.
Security
Supports Azure RBAC (Role-Based Access Control) and POSIX-like ACLs (Access Control Lists).
Fine-grained permissions down to folder/file level.
Integrated with Azure Active Directory (AAD) for authentication.
Cost-Effective
Pay-as-you-go pricing (like Blob).
Storage tiers (Hot, Cool, Archive) available.
Hierarchical namespace reduces overhead for analytics jobs (cheaper file operations).
Scalability & Performance
Handles petabytes to exabytes of data.
Optimized throughput for parallel analytics jobs.
Works with serverless and distributed compute engines.

Data Lakes: Centralized storage for structured + semi-structured + unstructured data.
Analytics: Source for Spark, Synapse, Databricks, HDInsight.
Machine Learning: Storing training datasets and ML feature stores.
ETL Pipelines: Staging raw → curated → consumable zones.
Archival Storage: Retain large volumes of log/event data at low cost.

Feature	Blob Storage	ADLS Gen2
Namespace	Flat	Hierarchical
File operations	Expensive (copy + delete)	Atomic (rename/move)
Security	Azure RBAC only	RBAC + POSIX ACLs
Analytics integration	Limited	Optimized for big data
APIs	Blob REST APIs	Blob APIs + HDFS-compatible APIs

A typical ADLS Gen2 data lake is organized into layers:

Great question 👍 The Hierarchical Namespace (HNS) is actually the defining feature of ADLS Gen2, so let’s go deeper.

Normally, Blob Storage is a flat namespace:
Every object (blob) lives in a single flat container.
The “folders” you see in the Azure portal are just virtual prefixes in blob names (sales/2025/january/data.csv is just a string, not a real folder).
Operations like rename or move are simulated (copy + delete), which is slow and costly.
In ADLS Gen2, the Hierarchical Namespace (HNS) adds:
True directories and subdirectories (like an actual file system).
Objects are tracked as files within directories, not just as strings.
File system operations (rename, move, delete directory, list directory) become atomic and efficient.

Rename/Move: In Blob storage → requires copy + delete (slow, doubles cost). In ADLS Gen2 → instant metadata update (atomic, cheap).
Delete Directory: In Blob storage → must delete each file one by one. In ADLS Gen2 → single operation at directory level.

Supports POSIX-like ACLs (Access Control Lists) at folder/file level. Example:
/raw/sales → only raw-data team has read/write.
/curated/finance → finance team has read-only.
Much finer granularity than just account/container level RBAC.

Hadoop/Spark jobs expect a hierarchical filesystem (HDFS).
With HNS, ADLS Gen2 behaves like HDFS → making Spark/Synapse/Databricks integration seamless.
Listing, partition pruning, directory scans are faster.

Guarantees atomic directory and file operations.
Example: If you rename a folder of 1M files → operation is atomic at the namespace level, no risk of half-renamed state.

Enabled at account creation → You must check “Hierarchical namespace” when creating a Storage Account for ADLS Gen2. (Cannot be disabled later.)
Once enabled:
Storage Account = Root
Containers = File systems
Directories = Actual folders
Files = Data objects

Path example (with HNS):

abfss://datalake@storageaccount.dfs.core.windows.net/raw/2025/transactions/file1.parquet

Here:

Think of:

Blob Storage (Flat) → A big box of papers where you prefix filenames with labels (sales_2025_jan_data.csv).
ADLS Gen2 (HNS) → A real filing cabinet with folders, subfolders, and files inside.