Database Fundamentals

0% completed

Previous
Next
TikTok System Design & Database Design

Designing a robust and scalable database system for TikTok involves understanding its core functionalities, user interactions, and the immense volume of data it handles daily. This case study explores the essential components and architectural decisions required to build an efficient TikTok database system, focusing on unique database engineering concepts to enhance student learning.

What is TikTok?

TikTok is a leading short-form video-sharing platform that allows users to create, share, and discover a vast array of user-generated content. It emphasizes personalized content discovery through sophisticated recommendation algorithms that analyze user interactions to curate tailored video feeds. TikTok supports features such as video uploads, likes, comments, shares, live streaming, and real-time notifications, catering to millions of active users globally.

Requirements and Goals of the System

To design TikTok's database system, we focus on fulfilling the following key requirements:

Functional Requirements

  1. User Management:

    • Sign Up/Login: Users can create accounts, authenticate, and manage profiles.
    • Follow System: Users can follow or unfollow other users to curate their content feed.
  2. Content Management:

    • Video Uploads: Users can upload, edit, and delete videos.
    • Video Playback: Efficiently serve videos to users with minimal latency.
    • Interactions: Users can like, comment on, and share videos.
  3. Feed Generation:

    • Personalized Feed: Generate a real-time, personalized video feed based on user preferences and interactions.
    • Trending Content: Highlight trending videos and hashtags.
  4. Real-time Notifications:

    • Engagement Alerts: Notify users of new followers, likes, comments, and shares.

Non-functional Requirements

  1. Scalability: Handle billions of video views and millions of concurrent users daily.
  2. Low Latency: Ensure quick video playback and real-time feed updates.
  3. High Availability: Maintain uptime even during peak traffic periods.
  4. Data Consistency: Ensure accurate recording of user interactions and content delivery.
  5. Security and Privacy: Protect user data and ensure secure content handling.

Storage Capacity Estimation

Estimating TikTok's storage needs involves calculating the volume of video content, user data, and interactions generated daily.

Assumptions:

  • Daily Active Users (DAU): 1 billion.
  • Videos Uploaded per User per Day: 2.
  • Average Video Size: 50 MB.
  • Daily Video Views: 10 billion.
  • Average Video Metadata Size: 2 KB per video.
  • Interactions per Video: 100 likes, 20 comments, 10 shares on average.

Calculations:

  • Daily Video Storage:

    • 1B users * 2 videos * 50 MB = 100,000 TB/day
  • Daily Video Metadata Storage:

    • 2B videos * 2 KB = 4 TB/day
  • Daily Interaction Storage:

    • (10B views * negligible size) + (10B shares * negligible size) + (10B likes * negligible size) + (2B comments * 1 KB) = ~2 TB/day
  • Total Daily Storage Requirement: Approximately 106 TB/day

Storage Fulfillment:

To manage this storage requirement, TikTok employs a combination of NoSQL databases for handling high-volume, unstructured data and object storage solutions for efficient video storage and retrieval. Data lifecycle management ensures timely deletion of expired content to optimize storage usage.

High-Level System Design

To efficiently manage TikTok's extensive requirements, we'll adopt a Microservices Architecture comprising four primary microservices:

  1. User Service
  2. Content Service
  3. Interaction Service
  4. Analytics Service

This modular approach ensures scalability, maintainability, and efficient handling of distinct functionalities while fulfilling all system requirements.

Image
TikTok High-level System Design

Key Components

  1. Clients

    • Mobile Apps: Native applications for iOS and Android.
    • Web Interface: Limited web functionalities for certain features.
  2. Load Balancers

    • Purpose: Distribute incoming traffic evenly across multiple instances of each microservice to prevent bottlenecks.
    • Examples: NGINX, HAProxy, AWS Elastic Load Balancer.
  3. API Gateway

    • Purpose: Acts as a single entry point for all client requests, handling routing and authentication.
    • Examples: Kong, AWS API Gateway, Zuul.
  4. Microservices: Different microservices are used to perform different activities. Explore the next section to learn about the different microservices we have used for the Airbnb system.

  5. Database Cluster

    • Relational Database: Stores structured data like user profiles and follow relationships.
    • NoSQL Databases: Manage unstructured data such as videos and interactions.
    • Object Storage: Stores media files efficiently with scalability.
  6. Message Queues

    • Purpose: Handle asynchronous tasks such as video processing, feed updates, and sending notifications.
    • Examples: Apache Kafka, RabbitMQ.
  7. File Storage

    • Purpose: Store and serve media content (videos) efficiently.
    • Examples: Amazon S3, Google Cloud Storage.

Microservices Architecture

Adopting a microservices architecture allows TikTok to scale each service independently and maintain a clear separation of concerns. Below is an overview of the four main microservices and how they fulfill system requirements.

1. User Service

  • Functionality: Handles user registration, login, profile management, and managing follow relationships.
  • Interactions:
    • API Gateway: Receives authentication and user management requests.
    • Relational Database: Stores user information and follow relationships.
  • Requirement Fulfillment:
    • User Management: Efficiently handles sign-up, login, and profile updates through a relational database ensuring data consistency.
    • Follow System: Manages user connections, leveraging relational schemas for integrity and complex queries.

2. Content Service

  • Functionality: Manages the lifecycle of videos, including uploading, processing (compression, transcoding), storage, and retrieval.
  • Interactions:
    • API Gateway: Receives video upload and retrieval requests.
    • Object Storage: Stores raw and processed video files.
    • Message Queues: Sends tasks for video processing and notifications.
    • Analytics Service: Provides data on video performance.
  • Requirement Fulfillment:
    • Video Uploads & Storage: Utilizes NoSQL databases and object storage to handle high-volume, unstructured video data efficiently.
    • Low Latency Playback: Ensures quick retrieval and streaming of videos through optimized storage solutions.
    • Video Processing: Automates compression and transcoding tasks to support various video formats and resolutions.

3. Interaction Service

  • Functionality: Facilitates user interactions such as likes, comments, shares, and tracking views.
  • Interactions:
    • API Gateway: Receives interaction requests from clients.
    • NoSQL Database: Stores interaction data for high throughput and low latency.
  • Requirement Fulfillment:
    • Real-time Interactions: Uses NoSQL databases and in-memory data stores to manage high-speed read/write operations.
    • Data Consistency: Ensures accurate recording of interactions through coordinated updates and eventual consistency models.
    • Engagement Tracking: Monitors user interactions to enhance personalized recommendations.

4. Analytics Service

  • Functionality: Collects and analyzes user engagement data to refine recommendation algorithms and monitor content performance.
  • Interactions:
    • Message Queues: Receives data from User, and Content Services.
    • Data Warehouses: Aggregates and stores large volumes of interaction data.
    • Big Data Processing: Processes data to generate insights and optimize recommendations.
  • Requirement Fulfillment:
    • User Engagement Tracking: Analyzes how users interact with content to improve recommendation algorithms.
    • Content Performance Analysis: Evaluates video performance metrics to inform content moderation and feature enhancements.
    • Real-time Analytics: Provides timely insights to adjust personalized feeds dynamically.

Database Types

TikTok's diverse data and access patterns necessitate the use of multiple database types, each optimized for specific use cases.

1. Relational Databases (SQL)

Use Cases:

  • User Management: Storing user profiles, authentication details, and follow relationships.
  • Transactional Operations: Ensuring data consistency for critical operations like user registrations and profile updates.

Examples:

  • PostgreSQL: Known for its robustness and advanced features.
  • MySQL: Widely used for its reliability and performance.

2. NoSQL Databases

Use Cases:

  • Media Storage: Handling large volumes of videos with high write and read throughput.
  • Interactions: Managing real-time interactions such as likes, comments, shares, and views.

Examples:

  • Cassandra: Suitable for handling high-velocity data with excellent write performance.
  • MongoDB: Flexible schema design, useful for storing diverse media content.

3. Object Storage

Use Cases:

  • Media Storage: Efficiently storing and retrieving large media files like videos.
  • Scalability: Managing massive amounts of unstructured data with ease.

Examples:

  • Amazon S3: Highly durable and scalable object storage service.
  • Google Cloud Storage: Offers similar features with integration into Google's ecosystem.

Database Schema

Designing an effective database schema is crucial for ensuring data integrity, efficient access, and scalability. For TikTok, leveraging both relational and NoSQL databases allows optimization of different aspects of the platform based on their unique requirements and access patterns. Below, we explore the schema designs tailored to TikTok to fulfill the system requirements.

1. Relational Schema

Image

Relational databases are ideal for structured data with well-defined relationships, such as user profiles, follow relationships, and content categorization.

Users Table

Column NameData TypeDescription
user_id (PK)BIGINTUnique identifier for each user
usernameVARCHARUnique username
emailVARCHARUser's email address
password_hashVARCHARHashed password for security
display_nameVARCHARUser's display name
bioTEXTUser's biography
creation_timeTIMESTAMPAccount creation timestamp

Followers Table

Column NameData TypeDescription
follower_id (PK)BIGINTUnique identifier for the follower relationship
user_id (FK)BIGINTID of the user being followed
follower_user_id (FK)BIGINTID of the follower user
creation_timeTIMESTAMPTimestamp when the follow occurred

Categories Table

Column NameData TypeDescription
category_id (PK)INTUnique identifier for category
nameVARCHARName of the category (e.g., Dance, Comedy)
descriptionTEXTDescription of the category

2. NoSQL Schema (Using Cassandra)

NoSQL databases like Cassandra are suitable for handling TikTok's high-volume, time-series data such as videos, interactions, and user activity logs.

Videos Table

CREATE TABLE videos ( video_id UUID PRIMARY KEY, user_id BIGINT, video_url TEXT, thumbnail_url TEXT, category_id INT, description TEXT, upload_time TIMESTAMP, views_count BIGINT, likes_count BIGINT, comments_count BIGINT, shares_count BIGINT, expiration_time TIMESTAMP );

Interactions Table

CREATE TABLE interactions ( interaction_id UUID PRIMARY KEY, video_id UUID, user_id BIGINT, interaction_type ENUM, -- like, comment, share interaction_time TIMESTAMP, comment_text TEXT, -- nullable, only for comments parent_interaction_id UUID -- nullable, for replies );

UserActivity Table

CREATE TABLE user_activity ( user_id BIGINT, activity_type ENUM, -- login, upload, like, comment, share activity_time TIMESTAMP, details TEXT, PRIMARY KEY (user_id, activity_time) ) WITH CLUSTERING ORDER BY (activity_time DESC);

Considerations

  • Normalization vs. Denormalization:

    • Relational Schema: Emphasizes normalization to reduce redundancy, ensuring data integrity for user-related information.
    • NoSQL Schema: Embraces denormalization to optimize read performance for high-volume data like videos and interactions.
  • Indexing:

    • Relational Databases: Indexes on primary and foreign keys to speed up joins and lookups.
    • NoSQL Databases: Secondary indexes on frequently queried fields (e.g., video_id in Interactions) to enhance retrieval efficiency.
  • Scalability:

    • NoSQL Databases: Designed for horizontal scaling, allowing TikTok to handle growing data volumes seamlessly.
  • Data Consistency:

    • Relational Databases: Ensure strong consistency for user and follow data through ACID transactions.
    • NoSQL Databases: Utilize eventual consistency for high availability and performance in handling videos and interactions.

Recommendation Algorithms

Personalized recommendations are at the core of TikTok's user engagement strategy. Efficiently storing and processing data to support these algorithms is crucial.

  • Feature Storage: Store user interaction data and video metadata to feed into recommendation models.
  • Model Serving: Deploy machine learning models that can quickly access relevant data to generate real-time recommendations.

Implementation:

  • Vector Databases: Use specialized databases like FAISS or Milvus to store and query high-dimensional feature vectors used in recommendation algorithms.
  • Model Training Pipelines: Implement data pipelines that continuously train and update recommendation models based on incoming interaction data.

.....

.....

.....

Like the course? Get enrolled and start learning!
Previous
Next