Data Architecture for CTV/OTT Ad Delivery (with Enhanced Privacy and Scalability)
Introduction:
This post explores a data architecture suitable for a CTV/OTT ad delivery system, emphasizing scalability, revenue tracking, user privacy, and responsible data handling. It adopts an events-based approach for data flow, ensuring accurate ad billing and preventing double billing. We'll draw inspiration from structures used by leading on-demand streaming platforms, while adhering to data privacy best practices.
Data Flow and Components:
The system utilizes a two-way data flow:
Ad Decision and Delivery: This flow determines which ads are shown to specific users and delivers them.
Ad Event Feedback and Analytics: This flow captures user interaction with ads and provides insights for advertisers and the streaming platform.
Key Components:
Machine Learning Model (External): This pre-existing model predicts ad click-through rates and audience segments for targeted advertising. (This functionality is assumed to be outside the scope of this interview.)
Key-Value Store (Redis): Stores a list of available ads for a particular event, ensuring fast retrieval for ad selection.
Content Delivery Network (CDN): Manages ad delivery across different time zones and handles ad appropriateness checks.
API Gateway: Provides a secure and privacy-conscious interface for ad selection and interaction logging.
Enhanced User Privacy:
Dual User IDs: The system employs two user IDs:
User ID: Used for ad serving and personalization within the platform.
Anonymized ID: Used for billing, anonymized analytics, and feedback loops. This ID is retained even after user deletion, allowing for ongoing campaign analysis without privacy concerns.
Data Management and Analytics:
Document Database: Stores detailed information about ad impressions, including user interaction (play, pause, skip), timestamp, replay status, errors, and latency, all linked to the anonymized ID.
Apache Kafka: A real-time streaming platform that ingests ad event data from the API Gateway, linked to anonymized IDs.
Apache Spark: Performs real-time analytics on ad event data using Kafka for insights dashboards, focusing on anonymized data for advertiser and platform insights.
Data Warehouse: Stores both raw and aggregated ad event data. User IDs are never stored in the data warehouse.
Real-time Data: Stores anonymized data for fast analytics on current ad performance.
Aggregated Data: Stores anonymized and aggregated data for historical analysis and reporting.
Addressing Challenges:
Double Billing Prevention: The document database ensures an ad is only recorded as "shown" once, even if replayed. We also maintain double the anticipated ad inventory to minimize the chance of no ad being available.
Privacy-Conscious Design: The API Gateway enforces data access controls and anonymizes user data before exposing it to analytics. User IDs are never stored in the data warehouse.
Reduced User Churn Focus: While the system models subscription changes, its primary focus is on delivering relevant ads, not influencing user decisions.
Data Exposure and Analytics:
Real-time Analytics: Advertisers and streaming platforms receive real-time estimates of ad impressions and potential revenue through Apache Spark dashboards, utilizing anonymized data.
Aggregated User Segmentation Insights: The data warehouse allows for post-event analysis by joining anonymized data with pre-existing user segmentation information. This can be done as a batch job to avoid real-time delays and ensure sufficient data is available for meaningful insights.
Micro-Batching for Specific Needs: For specific business needs requiring user segmentation during the event week, micro-batching jobs can be implemented in the data warehouse. However, these will have a slight delay compared to real-time anonymized data to ensure privacy and responsible data handling. Discussions with business partners can determine the appropriate balance between timeliness and user privacy for such insights.
Platform Developer Insights: Aggregated data from the data warehouse, devoid of user IDs, allows platform developers to monitor user viewership trends and adjust server capacity using micro-services deployed through load balancers.
Security and Privacy Considerations:
Leverage industry-standard encryption protocols for data transmission and storage.
Implement user consent mechanisms for data collection and adhere to relevant privacy regulations.
Benefits:
Scalable architecture to handle peak viewership events.
Accurate ad billing with double-billing prevention.
Real-time insights for advertisers and the streaming platform, focused on anonymized data.
User privacy focus with dual user IDs, anonymized data storage, and responsible data handling practices.