Saltar a contenido

Azure Data Lake Gen2 Overview


🔹 What is ADLS Gen2?

  • Azure Data Lake Storage Gen2 is Microsoft’s enterprise-grade, big data storage service built on top of Azure Blob Storage.
  • It combines the scalability, durability, and low cost of Blob Storage with a hierarchical namespace (folders & files like a traditional file system).
  • It’s designed for analytics and big data workloads (Spark, Databricks, Synapse, HDInsight, etc.), while still being general-purpose storage.

🔹 Key Features

  1. Hierarchical Namespace (HNS)

  2. Unlike flat Blob Storage, ADLS Gen2 organizes data into directories and subdirectories.

  3. Enables atomic file operations like rename and move at the directory/file level.
  4. Reduces cost and complexity of working with files in analytics.

  5. Unified Storage

  6. Built on Blob Storage → same account, same data redundancy options, same durability.

  7. No need to maintain separate “data lake” and “blob” accounts.

  8. Optimized for Big Data Analytics

  9. Works natively with Apache Hadoop (HDFS) APIs.

  10. Seamless integration with Azure Databricks, Synapse Analytics, HDInsight, Azure Data Factory.

  11. Security

  12. Supports Azure RBAC (Role-Based Access Control) and POSIX-like ACLs (Access Control Lists).

  13. Fine-grained permissions down to folder/file level.
  14. Integrated with Azure Active Directory (AAD) for authentication.

  15. Cost-Effective

  16. Pay-as-you-go pricing (like Blob).

  17. Storage tiers (Hot, Cool, Archive) available.
  18. Hierarchical namespace reduces overhead for analytics jobs (cheaper file operations).

  19. Scalability & Performance

  20. Handles petabytes to exabytes of data.

  21. Optimized throughput for parallel analytics jobs.
  22. Works with serverless and distributed compute engines.

🔹 Use Cases

  • Data Lakes: Centralized storage for structured + semi-structured + unstructured data.
  • Analytics: Source for Spark, Synapse, Databricks, HDInsight.
  • Machine Learning: Storing training datasets and ML feature stores.
  • ETL Pipelines: Staging raw → curated → consumable zones.
  • Archival Storage: Retain large volumes of log/event data at low cost.

🔹 ADLS Gen2 vs Blob Storage

Feature Blob Storage ADLS Gen2
Namespace Flat Hierarchical
File operations Expensive (copy + delete) Atomic (rename/move)
Security Azure RBAC only RBAC + POSIX ACLs
Analytics integration Limited Optimized for big data
APIs Blob REST APIs Blob APIs + HDFS-compatible APIs

🔹 Architecture in a Data Lake

A typical ADLS Gen2 data lake is organized into layers:

  • Raw Zone → direct dump from source systems.
  • Staging/Curated Zone → cleaned, transformed datasets.
  • Presentation Zone → business-ready, aggregated data.

Great question 👍 The Hierarchical Namespace (HNS) is actually the defining feature of ADLS Gen2, so let’s go deeper.


🔹 What is a Hierarchical Namespace?

  • Normally, Blob Storage is a flat namespace:

  • Every object (blob) lives in a single flat container.

  • The “folders” you see in the Azure portal are just virtual prefixes in blob names (sales/2025/january/data.csv is just a string, not a real folder).
  • Operations like rename or move are simulated (copy + delete), which is slow and costly.

  • In ADLS Gen2, the Hierarchical Namespace (HNS) adds:

  • True directories and subdirectories (like an actual file system).

  • Objects are tracked as files within directories, not just as strings.
  • File system operations (rename, move, delete directory, list directory) become atomic and efficient.

🔹 Why Hierarchical Namespace Matters

1. Efficient File Operations

  • Rename/Move: In Blob storage → requires copy + delete (slow, doubles cost). In ADLS Gen2 → instant metadata update (atomic, cheap).
  • Delete Directory: In Blob storage → must delete each file one by one. In ADLS Gen2 → single operation at directory level.

2. Security & Access Control

  • Supports POSIX-like ACLs (Access Control Lists) at folder/file level. Example:

  • /raw/sales → only raw-data team has read/write.

  • /curated/finance → finance team has read-only.
  • Much finer granularity than just account/container level RBAC.

3. Performance for Analytics

  • Hadoop/Spark jobs expect a hierarchical filesystem (HDFS).
  • With HNS, ADLS Gen2 behaves like HDFS → making Spark/Synapse/Databricks integration seamless.
  • Listing, partition pruning, directory scans are faster.

4. Atomic Consistency

  • Guarantees atomic directory and file operations.
  • Example: If you rename a folder of 1M files → operation is atomic at the namespace level, no risk of half-renamed state.

🔹 Technical Details of HNS

  • Enabled at account creation → You must check “Hierarchical namespace” when creating a Storage Account for ADLS Gen2. (Cannot be disabled later.)
  • Once enabled:

  • Storage Account = Root

  • Containers = File systems
  • Directories = Actual folders
  • Files = Data objects

Path example (with HNS):

abfss://datalake@storageaccount.dfs.core.windows.net/raw/2025/transactions/file1.parquet

Here:

  • storageaccount → ADLS Gen2 account
  • datalake → File system (container)
  • raw/2025/transactions → Real directories
  • file1.parquet → File object

🔹 Analogy

Think of:

  • Blob Storage (Flat) → A big box of papers where you prefix filenames with labels (sales_2025_jan_data.csv).
  • ADLS Gen2 (HNS) → A real filing cabinet with folders, subfolders, and files inside.

🔹 Benefits Summary

  • ✅ Faster rename/move/delete (atomic ops)
  • ✅ Lower cost for file management
  • ✅ Fine-grained ACL-based security
  • ✅ Seamless HDFS compatibility (Spark, Hadoop)
  • ✅ Cleaner data lake organization (raw → curated → presentation)