Global Retail Inc. Building the End to End Lakehouse

Case Study

About

This project presents the design and implementation of an end-to-end Lakehouse solution using Microsoft Fabric. The objective was to modernize analytics by moving data from a transactional Azure SQL database and flat files into a scalable Bronze, Silver, and Gold architecture. The solution enables faster reporting, simplified data modeling, and reliable insights for business users through automated pipelines and a clean star schema model. Enabled faster and more consistent business reporting by providing a single, trusted analytics layer for sales insights.

Challenge

Transactional Data Not Analytics-Ready: Source data was highly normalized in Azure SQL, making reporting slow and complex

Multiple Data Sources :Business data was split between an Azure SQL database and external flat files, requiring integration

Slow and Manual Reporting : Analysts spent significant time joining tables instead of analyzing data

Lack of a Single Source of Truth : Different reports used different logic, leading to inconsistent results

No Scalable Data Model: The existing structure did not support fast aggregation or BI performance.

Solution

Medallion Architecture (Bronze, Silver, Gold): Implemented a layered Lakehouse design to separate raw data, cleaned data, and analytics-ready data

Automated Data Ingestion: Used Microsoft Fabric pipelines to ingest data from Azure SQL and flat files into the Bronze layer

Data Transformation with PySpark: Applied business logic, joins, and data cleansing in PySpark notebooks to create the Silver layer

Analytics-Ready Gold Model: Built a star schema with fact and dimension tables to support fast and consistent reporting

End-to-End Orchestration: Built a master pipeline that orchestrated pipelines, notebooks, validation, and refresh schedules for daily automated execution.

Business impact

Faster Business Reporting – Daily sales and product insights are now available without complex query logic or delays

Consistent and Trusted Data – A centralized Gold layer provides a single, governed source of truth for all reports

Reduced Operational Effort – Automated pipelines and pre-modeled data eliminated repetitive data preparation work

Analytics-Ready Foundation – The scalable architecture supports future reporting, data sources, and business growth.

Control Table

To manage a large one-time data load in a controlled and scalable way, a control table is used to drive ingestion logic. This solution performs a one-time data load from a large Azure SQL database into the Fabric Bronze layer(Lakehouse). Since the data volume is high, the load is split into smaller batches instead of loading everything at once. A control table is used to define batch ranges and source queries, enabling dynamic ingestion without hardcoding logic in the pipeline. This approach allows the pipeline to automatically loop through batches and ensures a scalable and controlled load process

Watermark logic - Incremental Load

To handle the incoming daily data volume efficiently, incremental ingestion is implemented after the initial batch load. The pipeline first reads the last processed watermark value and then ingests only new or updated records into the Bronze layer. Once the load completes, the watermark is updated to ensure subsequent runs process only fresh data. This approach minimizes data movement, improves performance, and supports reliable daily data refreshes.

Silver Transformation

After data ingestion, raw data from the Bronze layer is processed using a PySpark notebook to create the Silver layer. In this step, customer data is joined with region mapping data to enrich it with business context. Transformations such as data cleansing, standardization, and derived fields like full customer name are applied. The Silver Customer table represents clean, consistent, and enriched data that serves as a foundation for downstream Gold modeling.

The transformed customer data is written as a Delta table named Silver_Customers in the Silver layer.
This Delta table stores clean and standardized customer data for downstream Gold modeling.

In this step, a surrogate key (CustomerKey) is generated for each customer using a monotonically increasing identifier. This key is used instead of the source system customer ID to uniquely identify records in the Gold layer.

Surrogate keys are required to support star schema design, enable efficient joins between fact and dimension tables, and ensure stable relationships even if source system IDs change over time.

Source system primary keys are not used because they can change, be reused, or differ across systems

Date Table Creation

A Date dimension is created by deriving calendar attributes such as year, quarter, month, day number, and day name from the transaction date. This table provides a consistent date reference for the Gold layer and enables time-based analysis like daily, monthly, and yearly reporting

In this step, sales data is aggregated to create the Fact_Sales table in the Gold layer. The data is grouped by date, customer, product, and order status, and key measures such as total sales amount, total quantity, and order count are calculated. This fact table serves as the core analytics table for reporting and business analysis.

Semantic Model

The Gold layer is modeled as a star schema with a central Fact_Sales table connected to Customer, Product, and Date dimension tables. Relationships are defined using surrogate keys to enable efficient joins and consistent analysis. This semantic model provides a clean and reliable foundation for Power BI reporting and business insights.

Master Pipeline

This master pipeline orchestrates the complete data workflow, starting with incremental ingestion for orders and sales, followed by Bronze to Silver and Silver to Gold transformations. Controlled wait steps ensure dependencies are met before triggering the semantic model refresh. The successful run confirms that the entire Lakehouse pipeline executes automatically and delivers fresh data for reporting on a daily schedule.

Power BI dashboard

Total Sales Amount by Region: Shows the total revenue distributed across regions (Midwest, South, Northeast, and West), helping compare regional performance.
Top 5 Products by Sales: Displays the top five products ranked by total sales amount, highlighting best-performing products.
Total Sales Amount Over Time: Shows daily sales trends over time to identify patterns, peaks, and seasonality.
Customer Filter: Allows users to filter all visuals by customer name for focused and detailed analysis.