Mastering SQL: LeetCode 2199 - Finding the Topic of Each Post

In today's data-driven world, the ability to extract meaningful information from text data is crucial. LeetCode's problem 2199, "Finding the Topic of Each Post", presents an excellent opportunity to hone your SQL skills in this area. Let's dive into this problem and explore different approaches to solve it.

The Problem

We're given two tables: Keywords and Posts. The task is to find the topics of each post based on the keywords present in its content. Here are the key points:

Keywords are case-insensitive.
A post can have multiple topics.
If a post has no matching keywords, its topic should be "Ambiguous!".
For posts with topics, we need to return a comma-separated list of topic IDs in ascending order.

The Solution

Let's break down the solution into steps:

Split the content of each post into individual words.
Match these words against the keywords (case-insensitive).
Group the results by post_id.
Aggregate the matching topic_ids.

Here's a PostgreSQL solution using unnest, also pay special attention to the use of the ',' and the ' ':

SELECT

p.post_id,

COALESCE(

STRING_AGG(DISTINCT CAST(k.topic_id AS VARCHAR), ',' ORDER BY CAST(k.topic_id AS VARCHAR)),

'Ambiguous!'

) AS topic

FROM

Posts p

LEFT JOIN LATERAL

unnest(string_to_array(lower(p.content), ' ')) AS words(word) ON true

LEFT JOIN

Keywords k ON lower(k.word) = words.word

GROUP BY

p.post_id;

This solution uses unnest and string_to_array to split the content into words. It's efficient and readable, making it a good choice for PostgreSQL environments.

Alternative Approaches

Using RECURSIVE CTE

For versions of databases that don't support unnest or string_to_array, we can use a RECURSIVE CTE, however emphasize here on the performance hit that you take withthis approach:

WITH RECURSIVE

words AS (

SELECT

post_id,

content,

1 AS word_number,

LOWER(SUBSTRING_INDEX(content, ' ', 1)) AS word

FROM

Posts

UNION ALL

SELECT

post_id,

content,

word_number + 1,

LOWER(SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', word_number + 1), ' ', -1))

FROM

words

WHERE

word_number < (LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)

)

SELECT

p.post_id,

COALESCE(

GROUP_CONCAT(DISTINCT k.topic_id ORDER BY k.topic_id),

'Ambiguous!'

) AS topic

FROM

Posts p

LEFT JOIN

words w ON p.post_id = w.post_id

LEFT JOIN

Keywords k ON LOWER(k.word) = w.word

GROUP BY

p.post_id;

While this approach works, it's important to note that recursive CTEs can be performance-intensive for large datasets. In a big data environment, this might not be the best choice.

Using STRING_SPLIT (MySQL 8.0.17+)

For newer versions of MySQL, we can use STRING_SPLIT:

SELECT

p.post_id,

COALESCE(

GROUP_CONCAT(DISTINCT k.topic_id ORDER BY k.topic_id),

'Ambiguous!'

) AS topic

FROM

Posts p

LEFT JOIN

STRING_SPLIT(LOWER(p.content), ' ') words ON true

LEFT JOIN

Keywords k ON LOWER(k.word) = words.value

GROUP BY

p.post_id;

This is a clean and efficient solution, but it's only available in MySQL 8.0.17 and later. Be sure to also try out this approach in some practice environment so that you feel comfortable using this solution.

Key Takeaways

Always be aware of the specific SQL engine and version you're working with. Different databases have different functions and optimizations, be sure to check this before an interview or otherwise ask this as a separate question before your interview.
Functions like unnest in PostgreSQL can greatly simplify text processing tasks. Familiarize yourself with such functions for your preferred database.
While recursive CTEs are powerful, be cautious about mentioning them in interviews without caveats. They can be problematic in big data scenarios.
Always have multiple approaches in your toolkit. If one method isn't available or performant, you should be ready with alternatives.
Consider mentioning performance implications of different approaches during interviews. It shows depth of understanding.

Remember, the best solution often depends on the specific context - database system, data volume, performance requirements, and more. By understanding these nuances, you'll be well-prepared for both coding interviews and real-world data challenges.