top of page
Gemini_Generated_Image_o4jnb5o4jnb5o4jn.png

  SanteFlux: The GDPR Privacy Crisis

Case Study

About

SantéFlux, a European health-tech company, faced a GDPR compliance crisis while delivering large-scale health analytics from millions of smartwatch readings. In this case study, a Databricks PySpark notebook was used to mask and hash sensitive user identifiers, filter low-quality and frozen sensor data, and generate city-level health trend aggregations. The solution also optimized performance through early filtering and efficient join strategies, enabling compliant, scalable analytics that could support executive decision-making under strict regulatory and performance constraints.

Challenge

  • GDPR Compliance Risk: Raw user identifiers and names were present in shared datasets, violating privacy regulations and blocking analytics access

  • ​

  • PII Masking Strategy: Decisions around hashing, salting, and consistency across teams impacted both security and data usability

​

  • Data Quality Issues: “Frozen” sensors produced repeated heart-rate values, skewing city-level health metrics

​

  • Performance Bottlenecks: Large aggregations triggered expensive Spark shuffles, causing long runtimes and instability

​

  • Join Complexity: VIP subscription analysis required joining a massive vitals table with a small reference dataset without overloading the cluster

Solution

  • Implemented a GDPR-compliant transformation approach using PySpark to mask and hash sensitive identifiers before any analytical processing.

​

  • Applied data quality filters to detect and isolate frozen sensor readings early, enabling predicate pushdown and reducing shuffle costs

​

  • Used efficient aggregation strategies (groupBy, pivot, avg) to generate a City Stress Matrix suitable for executive reporting

​

  • Optimized joins by leveraging broadcast joins for small reference datasets, minimizing shuffle overhead

​

  • Enabled interactive analysis through parameterization, allowing leadership to explore city-level trends without rerunning the full workload

Business impact

  • Restored GDPR Compliance: Enabled analytics access by masking and hashing sensitive user data, eliminating regulatory and legal exposure

​

  • Unblocked Executive Reporting: Delivered accurate, city-level health trend insights required for leadership and board-level decision-making

​

  • Improved Data Trustworthiness: Removed frozen and low-quality sensor readings, ensuring health metrics reflected real user behavior

​

  • Faster Analytical Performance: Reduced processing time by filtering bad data early and optimizing joins, avoiding costly Spark shuffles

​​

  • Scalable Analytics Foundation: Established a compliant and performant Databricks-based analytics approach that can scale with growing device data volumes

​This code reads the subscription data from a CSV file using a predefined schema.
Applying the schema ensures the data is clean and consistent before it is used in joins and analysis.

This snippet reads the raw vitals data from a CSV file using an explicit schema.
Defining the schema ensures correct data types for timestamps, heart rate values, and user information before any transformations are applied.

This snippet enforces GDPR compliance by masking user names and generating a secure hashed identifier using a shared salt.
The original user ID is removed, and the data is standardized for safe analytics without exposing personal information.

This step displays the masked dataset, where user names are partially seen and original identifiers are replaced with secure hashes.

This step groups the masked heart-rate readings into 10-minute time windows.
It helps analyze how a user’s heart rate changes over time using only privacy-safe data.

This step identifies frozen sensors by checking cases where the heart rate did not change for a full 10-minute window.
If the minimum and maximum heart rate are the same within that window, the reading is flagged as frozen and excluded from analysis.

This step removes frozen sensor readings by joining the main data with the frozen sensor list using a left anti join.
Only valid heart-rate readings are kept for further analysis.

This step creates a secure hashed ID for subscription data using a shared salt.
The original user ID is removed and duplicates are dropped, ensuring the data is privacy-safe and ready for joining with the vitals dataset.

This step joins the cleaned vitals data with the subscription data using a broadcast join on the secure ID.
Broadcast join is used because the subscription table is small, which avoids expensive shuffles and improves join performance

This final step produces a city-level health summary by grouping and pivoting the data by subscription type.
Databricks widgets are then used to let users interactively switch between cities, enabling quick analysis without rerunning the entire notebook.

bottom of page