Mastering SQL: LeetCode 2199 - Finding the Topic of Each Post
In today's data-driven world, the ability to extract meaningful information from text data is crucial. LeetCode's problem 2199, "Finding the Topic of Each Post", presents an excellent opportunity to hone your SQL skills in this area. Let's dive into this problem and explore different approaches to solve it.
The Problem
We're given two tables: Keywords and Posts. The task is to find the topics of each post based on the keywords present in its content. Here are the key points:
Keywords are case-insensitive.
A post can have multiple topics.
If a post has no matching keywords, its topic should be "Ambiguous!".
For posts with topics, we need to return a comma-separated list of topic IDs in ascending order.
The Solution
Let's break down the solution into steps:
Split the content of each post into individual words.
Match these words against the keywords (case-insensitive).
Group the results by post_id.
Aggregate the matching topic_ids.
Here's a PostgreSQL solution using unnest, also pay special attention to the use of the ',' and the ' ':
SELECT
p.post_id,
COALESCE(
STRING_AGG(DISTINCT CAST(k.topic_id AS VARCHAR), ',' ORDER BY CAST(k.topic_id AS VARCHAR)),
'Ambiguous!'
) AS topic
FROM
Posts p
LEFT JOIN LATERAL
unnest(string_to_array(lower(p.content), ' ')) AS words(word) ON true
LEFT JOIN
Keywords k ON lower(k.word) = words.word
GROUP BY
p.post_id;
This solution uses unnest and string_to_array to split the content into words. It's efficient and readable, making it a good choice for PostgreSQL environments.
Alternative Approaches
Using RECURSIVE CTE
For versions of databases that don't support unnest or string_to_array, we can use a RECURSIVE CTE, however emphasize here on the performance hit that you take withthis approach:
WITH RECURSIVE
words AS (
SELECT
post_id,
content,
1 AS word_number,
LOWER(SUBSTRING_INDEX(content, ' ', 1)) AS word
FROM
Posts
UNION ALL
SELECT
post_id,
content,
word_number + 1,
LOWER(SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', word_number + 1), ' ', -1))
FROM
words
WHERE
word_number < (LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
)
SELECT
p.post_id,
COALESCE(
GROUP_CONCAT(DISTINCT k.topic_id ORDER BY k.topic_id),
'Ambiguous!'
) AS topic
FROM
Posts p
LEFT JOIN
words w ON p.post_id = w.post_id
LEFT JOIN
Keywords k ON LOWER(k.word) = w.word
GROUP BY
p.post_id;
While this approach works, it's important to note that recursive CTEs can be performance-intensive for large datasets. In a big data environment, this might not be the best choice.
Using STRING_SPLIT (MySQL 8.0.17+)
For newer versions of MySQL, we can use STRING_SPLIT:
SELECT
p.post_id,
COALESCE(
GROUP_CONCAT(DISTINCT k.topic_id ORDER BY k.topic_id),
'Ambiguous!'
) AS topic
FROM
Posts p
LEFT JOIN
STRING_SPLIT(LOWER(p.content), ' ') words ON true
LEFT JOIN
Keywords k ON LOWER(k.word) = words.value
GROUP BY
p.post_id;
This is a clean and efficient solution, but it's only available in MySQL 8.0.17 and later. Be sure to also try out this approach in some practice environment so that you feel comfortable using this solution.
Key Takeaways
Always be aware of the specific SQL engine and version you're working with. Different databases have different functions and optimizations, be sure to check this before an interview or otherwise ask this as a separate question before your interview.
Functions like unnest in PostgreSQL can greatly simplify text processing tasks. Familiarize yourself with such functions for your preferred database.
While recursive CTEs are powerful, be cautious about mentioning them in interviews without caveats. They can be problematic in big data scenarios.
Always have multiple approaches in your toolkit. If one method isn't available or performant, you should be ready with alternatives.
Consider mentioning performance implications of different approaches during interviews. It shows depth of understanding.
Remember, the best solution often depends on the specific context - database system, data volume, performance requirements, and more. By understanding these nuances, you'll be well-prepared for both coding interviews and real-world data challenges.