Table of Contents
Introduction
The Medallion Architecture is a data architecture pattern commonly used in data lake implementations like Databricks.
It organizes data processing and storage in multiple layers, each serving a distinct purpose. The architecture is designed to streamline data management, improve quality, and facilitate efficient data processing and analytics.
The Medallion Architecture helps organizations manage their data more effectively, ensuring that data is processed and made available in the most useful form for various business needs.
Benefits of Using the Medallion Architecture
- Improved Data Quality: The Medallion Architecture enhances data quality by organizing data into bronze (raw), silver (cleaned and enriched), and gold (aggregated and business-ready) layers.
- Improved Collaboration: The Medallion Architecture’s clear separation of data layers allows for improved collaboration among data engineers, data scientists, and business analysts, making data processing and analysis more efficient.
- Enhanced Data Governance: The Medallion Architecture structures data into stages for better governance and compliance. It allows transparent data lineage tracking, which is crucial for regulatory compliance and audits.
Layers
Bronze (raw data)
The Bronze layer stores data from external source systems. Its table structures reproduce the source system tables, with additional metadata columns to capture information like the load date/time and process ID.
This layer focuses on capturing all the data in its original form (raw), ensuring that no information is lost during the ingestion process.
This layer also facilitates data lineage, auditability, and data reprocessing if needed without requiring re-reading the data from the source system.
Silver (cleansed and conformed data)
Here, the data is enhanced with additional information from other sources, such as lookups and joins. Also, the data must meet quality standards before being further processed.
This layer focuses on cleaning, filtering, and transforming data to make it more reliable and structured for further processing.
A preliminary analysis and creation of intermediate datasets for specific analytical purposes can also be done.
Gold (curated business-level tables)
The data within this layer has been cleaned, transformed, and enhanced to meet standards.
The final layer contains refined and top-notch processed data, simplifying the process for end-users and business intelligence tools to access and utilize it effectively.
This layer is where refined and business-ready data is collected, organized, and optimized for business intelligence (BI), analytics, and reporting.
This layer uses de-normalized data models with fewer joins.
Demo
Coming soon.