To remove duplicates while keeping the oldest row per group, which SQL approach is correct?

Prepare for the FAST Enterprises IC Interview. Enhance your skills with flashcards and multiple-choice questions. Each question provides hints and detailed explanations. Excel in your interview!

Multiple Choice

To remove duplicates while keeping the oldest row per group, which SQL approach is correct?

Explanation:
The idea is to assign a per-group rank to identify duplicates and keep the oldest one. By partitioning by the group columns and ordering by created_at in ascending order, the oldest row in each group gets a rank of 1. Deleting every row with a rank greater than 1 removes the duplicates while preserving the earliest entry in each group. In practice, you create a ranking for each row, then delete the rows where the rank isn’t 1. For example, use a window function like ROW_NUMBER() OVER (PARTITION BY group_columns ORDER BY created_at ASC) as rn in a subquery or CTE, and then delete where rn > 1. This guarantees you keep exactly one row per group—the oldest one. Other options miss this: DISTINCT only affects a result set and doesn’t remove rows from the table. GROUP BY with MAX(created_at) isn’t a valid direct deletion strategy for removing rows. Joining with a non-deterministic random function would discard rows at random and not guarantee that the oldest is kept.

The idea is to assign a per-group rank to identify duplicates and keep the oldest one. By partitioning by the group columns and ordering by created_at in ascending order, the oldest row in each group gets a rank of 1. Deleting every row with a rank greater than 1 removes the duplicates while preserving the earliest entry in each group.

In practice, you create a ranking for each row, then delete the rows where the rank isn’t 1. For example, use a window function like ROW_NUMBER() OVER (PARTITION BY group_columns ORDER BY created_at ASC) as rn in a subquery or CTE, and then delete where rn > 1. This guarantees you keep exactly one row per group—the oldest one.

Other options miss this: DISTINCT only affects a result set and doesn’t remove rows from the table. GROUP BY with MAX(created_at) isn’t a valid direct deletion strategy for removing rows. Joining with a non-deterministic random function would discard rows at random and not guarantee that the oldest is kept.

Subscribe

Get the latest from Passetra

You can unsubscribe at any time. Read our privacy policy