
Global Freight Forwarders
Case Study
About
This project delivers an automated, incremental data ingestion solution for Global Freight Forwarders using Microsoft Fabric. The solution replaces a manual, inefficient workflow of processing raw JSON shipping logs with a robust pipeline that intelligently identifies and loads only new files into a Delta Lake table. The outcome enables the operations team to access consolidated, near real-time shipment data without manual intervention, ensuring data integrity and accelerating critical operational analysis.
Challenge
-
Manual Processing Delays: Operations struggled with a manual workflow that prevented near real-time analysis of shipment patterns.
​
-
Unstructured Log Growth: Daily additions of raw JSON shipping logs created a high-volume management challenge.
​
-
Error-Prone File Identification: Manually identifying new files was slow, prone to human error, and delayed critical analysis.
​
-
Inability to Scale: The manual process could not scale with data growth, hindering operational department efficiency.
​
-
Data Processing Bottleneck: These manual steps created a significant bottleneck, limiting visibility into logistics data
Solution
​​
-
Incremental Data Loading: Implemented intelligent logic to identify and process only new or modified shipping logs each day, preventing data duplication and ensuring scalability.
​
-
Automated Ingestion Logic: Developed a robust, automated pipeline in Microsoft Fabric to replace manual data workflows for Global Freight Forwarders.
​
-
Delta Lake Integration: Architected the pipeline to consolidate raw JSON logs into a single, transactional Delta Lake table named Shipping Logs.
​
-
Bronze Layer Optimization: Automated the movement of raw data from the "Files" section to the managed "Tables" section within the Lakehouse Bronze layer.
​
-
Data Integrity & Reliability: Leveraged Delta Lake’s transactional storage layer to ensure high data reliability and schema consistency for daily operational runs.
​
-
Near Real-Time Visibility: Streamlined the ingestion of fields such as ShipmentID, CarrierName, and Status to provide the operations team with immediate access to shipment patterns.
​
Business impact
-
Incremental Ingestion Logic: Engineered a "new-only" detection strategy that identifies and processes only the daily delta of JSON shipping logs. This ensures that older data is not re-loaded, maintaining high performance as the total data volume grows
​
-
Eliminated Manual Bottlenecks : Successfully automated 100% of the manual file identification process, removing human error from the logistics workflow.​
​
-
Near Real-Time Operational Awareness: Reduced the time between file arrival and analysis, allowing for faster tracking of in-transit and delayed shipments.
​​
-
Scalable Data Engineering: Built a production-ready ingestion pattern that maintains high performance as GFF’s daily shipping log volume grows.
​
-
Unified Source of Truth: Provided the operations manager with a consolidated, queryable repository of all global logistics data in a Delta Lake format.

Watermark Table creation
To achieve true incremental ingestion, I implemented a control mechanism using a Watermark Table in Microsoft Fabric to act as the pipeline's "memory." By storing the Start Date (last processed timestamp) and End Date (current run time), the pipeline identifies the precise high-water mark needed to fetch only logs added within that specific window. This logic ensures that historical logistics data is never re-processed, significantly reducing compute costs and preventing data duplication in the Shipping Logs table.
​

Incremental load pipeline
The pipeline is composed of the following orchestrated activities:
-
Lookup – Read Watermark
Retrieves the last processed timestamp from a control (watermark) table stored in the Lakehouse. -
Copy Data – Incremental Load
Uses the retrieved watermark to filter source data and ingest only new shipping log records into the Lakehouse. -
Notebook – Update Watermark
After a successful load, a Fabric notebook updates the watermark table with the current execution timestamp, preparing the pipeline for the next incremental run.

