
AeroSurf Nsw- Hybrid SCD Implementation
Case Study
About
AeroSurf NSW manages large volumes of drone telemetry data used to monitor shark activity and ocean conditions across Sydney’s beaches. The data pipeline is built using PySpark and Delta Lake to handle schema changes, enforce data quality, and maintain historical accuracy. A hybrid Slowly Changing Dimension (SCD) approach is used to track critical risk-level changes over time while updating non-critical attributes efficiently. Delta Change Data Feed (CDF) enables full auditability and recovery by capturing every row-level change, ensuring the system remains reliable, compliant, and scalable for operational decision-making.
Challenge
-
Schema Drift: Telemetry data from different drone models arrived with changing schemas, causing ingestion failures and data loss risks.
-
Historical Accuracy: Critical changes in shark risk levels needed full history tracking, while non-critical updates had to be handled without creating noise.
-
Data Corruption Risk: Manual updates introduced the possibility of large-scale data errors, requiring a reliable recovery and audit mechanism.
-
Audit & Compliance: Regulatory requirements demanded a clear chain of custody, including who changed data, when it changed, and how it was restored.
-
Operational Time Pressure: Data had to be processed and validated within a narrow early-morning window to support real-time safety decisions.
Solution
-
Schema Evolution with Delta Lake: Enabled schema merging during ingestion to support multiple drone models without breaking pipelines.
-
Hybrid SCD Implementation: Applied SCD Type 2 for critical risk-level changes to preserve history, while using SCD Type 1 for non-critical attribute updates to reduce noise.
-
Controlled Data Recovery: Used Delta Time Travel and RESTORE to quickly recover from accidental data corruption without data loss.
-
Audit with Change Data Feed (CDF): Captured row-level changes to provide a complete audit trail for forensic analysis and compliance reporting.
-
Operational Reliability: Optimized ingestion and processing to ensure data was ready within strict early-morning time windows
Business impact
-
Improved Public Safety: Delivered accurate, timely risk data to support early-morning beach safety decisions.
-
Reduced Operational Risk: Enabled fast recovery from data issues, minimizing downtime and decision delays.
-
Audit Readiness: Provided full data traceability to meet legal, regulatory, and compliance requirements.
-
Scalable Architecture: Supported growing data volumes and new drone models without redesigning the pipeline.
-
Increased Trust in Data: Ensured stakeholders could rely on historical and current data for critical decisions.

This implementation uses Databricks Auto Loader (CloudFiles) to ingest JSON telemetry data into the Bronze layer of a Lakehouse architecture. The stream is configured with schema evolution (mergeSchema) to handle new columns without breaking the pipeline, while maintaining a dedicated schema location and checkpoint directory for reliability and fault tolerance and the data is written directly into a Delta table to enable ACID transactions, time travel, and scalable downstream transformations.
Hybrid SCD Implementation


This implementation demonstrates a Hybrid Slowly Changing Dimension (SCD) logic using a Delta Lake MERGE statement. The solution combines Type 1 and Type 2 behavior within a single operation. When non-critical attributes such as battery or water temperature change, the existing record is updated in place (SCD Type 1). However, when the critical attribute risk_level changes, the logic preserves history by expiring the current record and allowing insertion of a new version (SCD Type 2). The merge condition ensures updates are applied only to active records (is_current = true), maintaining accurate historical tracking while preventing duplicate active rows. This approach balances auditability with performance and reflects real-world enterprise dimension management requirements.

The DESCRIBE HISTORY workspace.aerosurf.bronze_drone_log command provides a complete audit trail of all operations performed on the Delta table. It captures version numbers, timestamps, operation types (INSERT, UPDATE, MERGE, RESTORE), user information, and operation metrics. This enables full transparency into data changes, making it possible to identify incidents such as unintended updates, track when they occurred, and restore the table to a previous stable version using Delta Time Travel. This built-in versioning mechanism is critical for governance, compliance, and forensic analysis in production data environments.

This query leverages Delta Lake’s Change Data Feed (CDF) to generate a forensic audit trail of row-level changes within the table. By querying table_changes() between specific version ranges (X to Y), it captures inserts, updates, and deletes along with their corresponding commit version and timestamp. This enables precise tracking of when records were modified, what changed, and which transactions were responsible. CDF is especially valuable for compliance, impact analysis, and incident recovery, as it provides transparent visibility into data mutations without relying on external logging systems.

