Mastering SQL: LeetCode 2199 - Finding the Topic of Each Post

In today's data-driven world, the ability to extract meaningful information from text data is crucial. LeetCode's problem 2199, "Finding the Topic of Each Post", presents an excellent opportunity to hone your SQL skills in this area. Let's dive into this problem and explore different approaches to solve it.

The Problem

We're given two tables: Keywords and Posts. The task is to find the topics of each post based on the keywords present in its content. Here are the key points:

The Solution

Let's break down the solution into steps:

Here's a PostgreSQL solution using unnest, also pay special attention to the use of the ',' and the ' ':

SELECT 

    p.post_id,

    COALESCE(

        STRING_AGG(DISTINCT CAST(k.topic_id AS VARCHAR), ',' ORDER BY CAST(k.topic_id AS VARCHAR)),

        'Ambiguous!'

    ) AS topic

FROM 

    Posts p

LEFT JOIN LATERAL

    unnest(string_to_array(lower(p.content), ' ')) AS words(word) ON true

LEFT JOIN 

    Keywords k ON lower(k.word) = words.word

GROUP BY 

    p.post_id;

This solution uses unnest and string_to_array to split the content into words. It's efficient and readable, making it a good choice for PostgreSQL environments.

Alternative Approaches

Using RECURSIVE CTE

For versions of databases that don't support unnest or string_to_array, we can use a RECURSIVE CTE, however emphasize here on the performance hit that you take withthis approach:

WITH RECURSIVE 

words AS (

    SELECT 

        post_id, 

        content,

        1 AS word_number,

        LOWER(SUBSTRING_INDEX(content, ' ', 1)) AS word

    FROM 

        Posts

    UNION ALL

    SELECT 

        post_id, 

        content,

        word_number + 1,

        LOWER(SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', word_number + 1), ' ', -1))

    FROM 

        words

    WHERE 

        word_number < (LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)

)

SELECT 

    p.post_id,

    COALESCE(

        GROUP_CONCAT(DISTINCT k.topic_id ORDER BY k.topic_id),

        'Ambiguous!'

    ) AS topic

FROM 

    Posts p

LEFT JOIN 

    words w ON p.post_id = w.post_id

LEFT JOIN 

    Keywords k ON LOWER(k.word) = w.word

GROUP BY 

    p.post_id;

While this approach works, it's important to note that recursive CTEs can be performance-intensive for large datasets. In a big data environment, this might not be the best choice.

Using STRING_SPLIT (MySQL 8.0.17+)

For newer versions of MySQL, we can use STRING_SPLIT:

SELECT

    p.post_id,

    COALESCE(

        GROUP_CONCAT(DISTINCT k.topic_id ORDER BY k.topic_id),

        'Ambiguous!'

    ) AS topic

FROM

    Posts p

LEFT JOIN

    STRING_SPLIT(LOWER(p.content), ' ') words ON true

LEFT JOIN

    Keywords k ON LOWER(k.word) = words.value

GROUP BY

    p.post_id;

This is a clean and efficient solution, but it's only available in MySQL 8.0.17 and later. Be sure to also try out this approach in some practice environment so that you feel comfortable using this solution.

Key Takeaways

Remember, the best solution often depends on the specific context - database system, data volume, performance requirements, and more. By understanding these nuances, you'll be well-prepared for both coding interviews and real-world data challenges.