The Workforce Intelligence Engine

Case Study

About

This project focused on designing and implementing a scalable people analytics solution using a modern lakehouse architecture. The goal was to centralize fragmented workforce data and enable reliable, analytics-ready datasets to support HR and Finance reporting. The solution was built on Microsoft Fabric, leveraging pipelines and PySpark notebooks to ingest, transform, and prepare employee and department data for downstream analysis in a structured and repeatable manner

Challenge

The existing data setup presented several limitations that restricted timely workforce insights and operational efficiency:

Manual and Fragmented Data Handling : Employee and department data was delivered as raw files, requiring manual effort to prepare and analyze, increasing the risk of errors and inconsistencies.

Lack of Scalable Analytics :There was no centralized processing layer to support aggregations, joins, and historical workforce analysis as data volumes increased.

Delayed Decision-Making
Key questions around headcount, salary distribution, tenure, and departmental costs required ad hoc analysis, slowing down HR and Finance reporting.

No Reusable Data Pipeline
The absence of a standardized ingestion and transformation process made it difficult to reuse logic or extend analytics for future requirements.

Solution

Designed a data pipeline in Microsoft Fabric to copy raw CSV files into a Lakehouse environment

Used PySpark notebooks to cleanse, transform, and enrich employee and department datasets

Implemented joins, aggregations, window functions, and date-based calculations to support workforce analytics use cases

Created structured, analytics-ready tables that can be reused for reporting and further analysis

Business impact

The implemented lakehouse solution improved the reliability and efficiency of workforce analytics:

Improved Data Availability: Automated ingestion ensured consistent access to employee and department data without manual intervention.

Higher Data Consistency: PySpark-based schema enforcement and transformations reduced data quality issues in analytics outputs.

Faster Workforce Insights: Analytics-ready tables enabled quicker analysis of headcount, salary distribution, and employee tenure.

Scalable Foundation: The pipeline and notebook design supports future data growth and additional analytics use cases.

Employee and department DataFrames are joined using an inner join on department_id to combine related records. This creates a unified dataset that includes both employee details and their corresponding department information for further analysis

Grouped data by department and calculated average salary using aggregation functions, with explicit type casting and rounding for precision. This produces standardized, department-level salary metrics for accurate compensation analysis

Used Window.partitionBy("department_id").orderBy(col("salary").desc()) with the rank() window function to rank employees within each department based on salary. This solves the problem of identifying top-paid employees and comparing compensation levels within the same department

Calculated a running total of employee salaries within each department based on hire date order. This helps analyze how total salary cost grows over time as new employees are added to each department.