SanteFlux: The GDPR Privacy Crisis

Case Study

About

SantéFlux, a European health-tech company, faced a GDPR compliance crisis while delivering large-scale health analytics from millions of smartwatch readings. In this case study, a Databricks PySpark notebook was used to mask and hash sensitive user identifiers, filter low-quality and frozen sensor data, and generate city-level health trend aggregations. The solution also optimized performance through early filtering and efficient join strategies, enabling compliant, scalable analytics that could support executive decision-making under strict regulatory and performance constraints.

Challenge

GDPR Compliance Risk: Raw user identifiers and names were present in shared datasets, violating privacy regulations and blocking analytics access
PII Masking Strategy: Decisions around hashing, salting, and consistency across teams impacted both security and data usability

Data Quality Issues: “Frozen” sensors produced repeated heart-rate values, skewing city-level health metrics

Performance Bottlenecks: Large aggregations triggered expensive Spark shuffles, causing long runtimes and instability

Join Complexity: VIP subscription analysis required joining a massive vitals table with a small reference dataset without overloading the cluster

Solution

Implemented a GDPR-compliant transformation approach using PySpark to mask and hash sensitive identifiers before any analytical processing.

Applied data quality filters to detect and isolate frozen sensor readings early, enabling predicate pushdown and reducing shuffle costs

Used efficient aggregation strategies (groupBy, pivot, avg) to generate a City Stress Matrix suitable for executive reporting

Optimized joins by leveraging broadcast joins for small reference datasets, minimizing shuffle overhead

Enabled interactive analysis through parameterization, allowing leadership to explore city-level trends without rerunning the full workload

Business impact

Restored GDPR Compliance: Enabled analytics access by masking and hashing sensitive user data, eliminating regulatory and legal exposure

Unblocked Executive Reporting: Delivered accurate, city-level health trend insights required for leadership and board-level decision-making

Improved Data Trustworthiness: Removed frozen and low-quality sensor readings, ensuring health metrics reflected real user behavior

Faster Analytical Performance: Reduced processing time by filtering bad data early and optimizing joins, avoiding costly Spark shuffles

Scalable Analytics Foundation: Established a compliant and performant Databricks-based analytics approach that can scale with growing device data volumes

This code reads the subscription data from a CSV file using a predefined schema.
Applying the schema ensures the data is clean and consistent before it is used in joins and analysis.

This snippet reads the raw vitals data from a CSV file using an explicit schema.
Defining the schema ensures correct data types for timestamps, heart rate values, and user information before any transformations are applied.

This snippet enforces GDPR compliance by masking user names and generating a secure hashed identifier using a shared salt.
The original user ID is removed, and the data is standardized for safe analytics without exposing personal information.

This step displays the masked dataset, where user names are partially seen and original identifiers are replaced with secure hashes.

This step groups the masked heart-rate readings into 10-minute time windows.
It helps analyze how a user’s heart rate changes over time using only privacy-safe data.

This step identifies frozen sensors by checking cases where the heart rate did not change for a full 10-minute window.
If the minimum and maximum heart rate are the same within that window, the reading is flagged as frozen and excluded from analysis.

This step removes frozen sensor readings by joining the main data with the frozen sensor list using a left anti join.
Only valid heart-rate readings are kept for further analysis.

This step creates a secure hashed ID for subscription data using a shared salt.
The original user ID is removed and duplicates are dropped, ensuring the data is privacy-safe and ready for joining with the vitals dataset.

This step joins the cleaned vitals data with the subscription data using a broadcast join on the secure ID.
Broadcast join is used because the subscription table is small, which avoids expensive shuffles and improves join performance

This final step produces a city-level health summary by grouping and pivoting the data by subscription type.
Databricks widgets are then used to let users interactively switch between cities, enabling quick analysis without rerunning the entire notebook.