Database Fundamentals

0% completed

Previous
Next
Quora System Design & Database Design

Let's prepare a database design for a platform like Quora.

1. What is Quora?

Quora is a widely recognized question-and-answer (Q&A) platform where users can ask questions, provide answers, and engage in discussions on a myriad of topics. Established with the intent to share and grow the world's knowledge.

Given its extensive user base and the dynamic nature of content creation and consumption, Quora requires a robust and scalable database system to manage data efficiently.

2. Requirements and Goals of the System

Functional Requirements

  • Users should be able to post new questions.
  • Users should be able to write and post answers to questions.
  • Users should be able to follow specific topics and other users.
  • Users should be able to search for questions and topics.

Non-functional Requirements

  • High availability: The platform should remain accessible most of the time.
  • Low latency: The system should respond quickly, ideally within a few hundred milliseconds.
  • Scalability: The system should handle growing numbers of users and content.
  • Eventual consistency is acceptable in some areas (e.g., vote counts may not update immediately).

3. Storage Capacity Estimation

Estimating Quora's storage needs ensures the system can handle current demands and future growth efficiently.

  • Users: With 100 million active users at approximately 2 KB each, user data totals around 200 TB.
  • Questions: Assuming 10 million new questions annually at 1 KB each requires 10 TB/year.
  • Answers: Averaging 5 answers per question for 50 million answers per year at 2 KB each amounts to 100 TB/year.
  • Votes and Interactions: Handling 500 million votes annually at 0.5 KB each leads to 250 TB/year.
  • Media Content: If 20% of answers include media, resulting in 10 million media attachments at 500 KB each, this adds 5 TB/year.

Total Initial Storage: 200 TB (Users) + 365 TB (Annual Growth) = 565 TB

To accommodate data replication, backups, and future expansion, provisioning for at least 1 PB of storage is recommended.

4. High Level System Design

Designing a high-level architecture for Quora involves mapping out the core components and their interactions to meet the system's requirements.

Key Components

  1. Clients

    • Web Browsers: Users access Quora via desktop and mobile web browsers.
    • Mobile Apps: Native applications for iOS and Android devices.
  2. Load Balancers

    • Purpose: Distribute incoming traffic evenly across multiple instances of each microservice to prevent any single server from becoming a bottleneck.
    • Examples: NGINX, HAProxy, AWS Elastic Load Balancer.
  3. API Gateway

    • Purpose: Acts as a single entry point for all client requests, handling routing and authentication.
    • Examples: Kong, AWS API Gateway, Zuul.
  4. Microservices

    • User Service
      • Responsibilities: Manage user profiles, authentication, and authorization.
    • Question-Answer Service
      • Responsibilities: Handle creation, retrieval, and management of questions and answers.
    • Search Service
      • Responsibilities: Provide full-text search capabilities across questions and answers.
  5. Database Cluster

    • Relational Database: Stores structured data such as user profiles and relationships.
    • NoSQL Database: Manages unstructured data like questions and answers.
    • Search Engine: Indexes content to facilitate efficient search queries.
  6. Message Queues

    • Purpose: Handle asynchronous tasks such as indexing new content and sending notifications.
    • Examples: Apache Kafka, RabbitMQ.
  7. File Storage

    • Purpose: Store media content (images, videos) associated with questions and answers.
    • Examples: Amazon S3, Hadoop Distributed File System (HDFS).

Microservices Architecture

Adopting a microservices architecture allows Quora to scale each service independently and maintain a clear separation of concerns. Below is an overview of the three main microservices and their interactions.

1. User Service

  • Functionality:
    • Handles user registration, login, profile management, and authentication.
  • Interactions:
    • API Gateway: Receives authentication and user-related requests.
    • Database: Stores user information and relationships.

2. Question-Answer Service

  • Functionality:
    • Manages the lifecycle of questions and answers, including creation, editing, and retrieval.
  • Interactions:
    • API Gateway: Receives requests to post or fetch questions and answers.
    • Database: Stores questions and answers for high scalability and quick access.
    • File Storage: Handles media uploads associated with questions and answers.

3. Search Service

  • Functionality:
    • Provides full-text search capabilities across all questions and answers.
  • Interactions:
    • Message Queue: Consumes events from the Question-Answer Service to index new content.
    • Elasticsearch Cluster: Processes and returns search results based on user queries.
    • API Gateway: Receives search queries from clients and routes them to the Search Service.

Data Flow Overview

Image
Quora High-level System Design
  1. User Interaction:

    • A user interacts with Quora via a web browser or mobile app.
    • The request is sent to the Load Balancer, which directs it to the API Gateway.
  2. Request Routing:

    • The API Gateway authenticates the request and routes it to the appropriate Microservice (User Service, Question-Answer Service, or Search Service).
  3. Service Processing:

    • User Service: Processes user-related requests, interacts with the relational database, and publishes events to the message queue.
    • Question-Answer Service: Handles Q&A operations, interacts with the NoSQL database and file storage, and publishes events for indexing.
    • Search Service: Processes search queries by querying the Elasticsearch cluster and returning relevant results to the client.

5. Database Types

Quora's diverse data and access patterns necessitate the use of multiple database types, each optimized for specific use cases.

1. Relational Databases (SQL)

Use Cases:

  • User Management: Storing user profiles, authentication details, and relationships.
  • Transactional Operations: Ensuring data consistency for critical operations like updates to user information.

Examples:

  • PostgreSQL: Known for its robustness and advanced features.
  • MySQL: Widely used for its reliability and performance.

2. NoSQL Databases

Use Cases:

  • Content Storage: Handling large volumes of questions, answers, and interactions.
  • Scalability: Managing high read and write throughput.

Examples:

  • Cassandra: Suitable for handling large-scale, high-velocity data with excellent write performance.
  • MongoDB: Flexible schema design, useful for storing diverse content types.

3. Search Engine Databases

Use Cases:

  • Full-Text Search: Enabling efficient and relevant search capabilities across questions and answers.
  • Content Discovery: Facilitating features like trending topics and related questions.

Examples:

  • Elasticsearch: Highly scalable and offers real-time search capabilities.
  • Solr: Provides robust search features with extensive customization options.

6. Database Schema

Designing an effective database schema is pivotal for ensuring data integrity, efficient access, and scalability. For Quora, leveraging both relational and NoSQL databases allows us to optimize different aspects of the platform based on their unique requirements and access patterns. Below, we explore the schema designs tailored to each database type.

1. Relational Schema

Relational databases are ideal for structured data with well-defined relationships, such as user profiles and their interactions. Using a relational database ensures data consistency and supports complex queries essential for user management and relationships.

Image
Database Schema for Quora

Users Table

Column NameData TypeDescription
user_id (PK)BIGINTUnique identifier for each user
nameVARCHARUser's full name
emailVARCHARUser's email address
password_hashVARCHARHashed password for security
bioTEXTUser's biography
creation_timeTIMESTAMPAccount creation timestamp

Followers Table

Column NameData TypeDescription
follower_id (PK)BIGINTUnique identifier for the follower record
user_id (FK)BIGINTID of the user being followed
follower_user_id (FK)BIGINTID of the user who is following
creation_timeTIMESTAMPTimestamp when the follow occurred

User_Follow_Topic Table

Column NameData TypeDescription
follow_id (PK)BIGINTUnique identifier for the follow record
user_id (FK)BIGINTID of the user who is following
topic_id (FK)BIGINTID of the topic being followed
creation_timeTIMESTAMPTimestamp when the follow occurred

2. NoSQL Schema (Using a Document Store like MongoDB)

NoSQL databases offer flexibility in handling diverse and large-scale data, making them suitable for storing Quora's vast and dynamic content such as questions, answers, and interactions. By using a document-oriented approach, we can optimize for rapid read and write operations, essential for user engagement.

Image
NOSQL Database Schema

Questions Collection

Each document represents a question along with its associated answers. Embedding answers within the question document can optimize read performance when fetching a question and its answers together.

{ "question_id": ObjectId("60d5f483f8d2e45d7c8b4567"), "title": "How does Quora handle database scaling?", "body": "Detailed explanation of database scaling strategies...", "user_id": ObjectId("60d5f483f8d2e45d7c8b1234"), "creation_time": ISODate("2024-01-01T00:00:00Z"), "topics": ["Database Scaling", "System Design"], "answers": [ { "answer_id": ObjectId("60d5f483f8d2e45d7c8b8901"), "user_id": ObjectId("60d5f483f8d2e45d7c8b2345"), "body": "Quora employs a combination of relational and NoSQL databases to manage different data types efficiently...", "vote_count": 150, "creation_time": ISODate("2024-01-01T01:00:00Z"), "comments": [ { "comment_id": ObjectId("60d5f483f8d2e45d7c8b3456"), "user_id": ObjectId("60d5f483f8d2e45d7c8b4567"), "body": "Great explanation!", "creation_time": ISODate("2024-01-01T02:00:00Z") } ] }, // More answers... ] }

Users Collection

While user profiles are primarily managed in the relational database, certain user activities can be mirrored in the NoSQL database to optimize performance for specific queries.

{ "user_id": ObjectId("60d5f483f8d2e45d7c8b1234"), "name": "Jane Doe", "email": "jane.doe@example.com", "bio": "Enthusiastic learner and educator.", "creation_time": ISODate("2023-01-01T00:00:00Z"), "followers_count": 2500, "following_count": 300, "favorite_answers": [ ObjectId("60d5f483f8d2e45d7c8b8901"), ObjectId("60d5f483f8d2e45d7c8b8902") ], "followed_topics": ["Database Scaling", "Artificial Intelligence"] }

Considerations for NoSQL Schema

  • Denormalization: Embedding related data (like answers within questions) reduces the need for complex joins, enhancing read performance. However, it can lead to data redundancy.

  • Scalability: Document stores like MongoDB are designed to scale horizontally, handling large volumes of data with ease.

  • Flexibility: The schema can evolve over time without requiring extensive migrations, accommodating new features and data types seamlessly.

7. Sharding and Partitioning

As Quora scales, distributing data across multiple servers becomes essential to maintain performance and manageability. Sharding and partitioning strategies ensure that the database can handle high traffic and large datasets efficiently.

What is Sharding?

Sharding is the process of dividing a large database into smaller, more manageable pieces called shards. Each shard holds a subset of the data, allowing the system to distribute load and storage across multiple machines.

Sharding Strategies

  1. Sharding by UserID:

    Approach:

    • Distribute data based on the user's unique identifier.
    • All data related to a specific user (questions, answers, votes) reside on the same shard.

    Advantages:

    • Simplifies data retrieval for user-specific queries.
    • Localizes user interactions, reducing cross-shard operations.

    Challenges:

    • Hotspots: Popular users may generate disproportionate traffic, leading to performance bottlenecks.
    • Uneven Data Distribution: Users with extensive activity can consume more resources, making it difficult to maintain uniform shard sizes.
  2. Sharding by QuestionID:

    Approach:

    • Distribute data based on the unique identifier of each question.
    • All answers and interactions related to a question are stored on the same shard.

    Advantages:

    • Balances load more effectively as questions are generally accessed uniformly.
    • Reduces contention for popular questions.

    Challenges:

    • Complex Timeline Generation: Aggregating answers from multiple shards when generating user timelines can increase latency.
    • Cross-Shard Queries: Fetching related data may require querying multiple shards, complicating the retrieval process.
  3. Composite Sharding (Combining Multiple Strategies):

    Approach:

    • Combine sharding by UserID and QuestionID or incorporate hash-based techniques to distribute data more evenly.
    • For example, use a hash of UserID to determine the shard, then further distribute based on QuestionID within the shard.

    Advantages:

    • Balances the benefits of different sharding strategies.
    • Mitigates the drawbacks associated with single-strategy sharding.

    Challenges:

    • Increased Complexity: Managing composite sharding rules can complicate the system architecture.
    • Potential for Cross-Shard Operations: May still require cross-shard queries for certain operations.

Recommended Sharding Strategy for Quora

Given Quora's requirements for both high read and write throughput, a composite sharding strategy is advisable. Here's how it can be implemented:

  1. Primary Shard Key: QuestionID Hash:

    • Use a hash of the QuestionID to distribute questions and their associated answers across shards evenly.
    • This approach prevents hotspots, as questions are generally accessed uniformly.
  2. Secondary Shard Key: UserID Hash (for User Data):

    • Store user profiles, follow relationships, and other user-specific data on shards determined by hashing the UserID.
    • This ensures that user-related operations remain efficient and isolated from content-specific shards.
  3. Shard Management:

    • Implement Consistent Hashing to minimize data movement when adding or removing shards.
    • Use a Shard Manager service that maintains metadata about shard distribution and handles data routing.

8. Replication and Fault Tolerance

Ensuring data availability and reliability is paramount for Quora's database system. Replication and fault tolerance mechanisms safeguard against data loss and system downtimes. a

A master-slave replication approach is suitable for Quora, complemented by multi-master replication for specific services requiring high write availability.

  1. Relational Databases:
    • Use master-slave replication, where the master handles all write operations and slaves manage read requests.
    • Implement automatic failover mechanisms to promote a slave to master in case the master fails.
Image
Master Slave Replication
  1. NoSQL Databases:
    • Leverage multi-master replication capabilities inherent in databases like Cassandra, allowing any node to handle read and write operations.
    • Utilize data centers spread across different geographic regions to enhance availability and reduce latency.

Fault Tolerance Mechanisms

  1. Data Redundancy:

    • Maintain multiple replicas of data across different shards and data centers to prevent data loss.
  2. Automated Failover:

    • Implement systems that detect failures and automatically reroute traffic or promote replicas without manual intervention.
  3. Backup and Recovery:

    • Regularly back up data to secure storage solutions.
    • Establish robust recovery procedures to restore data and services swiftly in case of catastrophic failures.
  4. Load Balancing:

    • Distribute traffic evenly across replicas to prevent any single node from becoming overwhelmed.
    • Use health checks to route traffic away from unhealthy or failed nodes.

Conclusion

Designing Quora's database system involves a strategic blend of various database technologies, sharding and partitioning strategies, and robust replication mechanisms to ensure scalability, reliability, and high performance. By understanding the platform's requirements and meticulously planning each component, we can create a resilient and efficient system capable of supporting millions of users and their interactions seamlessly.

.....

.....

.....

Like the course? Get enrolled and start learning!
Previous
Next