How to Find Duplicates in SQL: A Step-by-Step Guide

By Cristian G. Guasch • Updated: 05/17/23 • 12 min read

Finding duplicates in SQL can be a tedious task, but it is an essential skill for anyone working with databases. Duplicates can cause errors, discrepancies, and inconsistencies in your data, leading to incorrect results and poor performance. Therefore, it is crucial to identify and remove duplicates from your tables to ensure data integrity and accuracy.

SQL provides several ways to find duplicates in your data, depending on your requirements and the structure of your tables. You can use the GROUP BY and HAVING clauses to group records by a particular column and filter out duplicates based on a count or condition. Alternatively, you can use the DISTINCT keyword to select only unique values and compare them with the original table to identify duplicates. There are also specialized functions and operators, such as COUNT(), EXISTS, and JOIN, that can help you find duplicates in more complex scenarios.

In this article, we will explore various techniques to find duplicates in SQL, from simple queries to advanced methods, and provide examples and best practices to help you master this skill. Whether you are a beginner or an experienced SQL developer, this guide will help you improve your data quality and efficiency by detecting and eliminating duplicates in your tables.

Finding Duplicates Using GROUP BY and HAVING Clauses

One way to find duplicate values in SQL is by using the GROUP BY and HAVING clauses. These clauses allow you to group rows that have the same values in one or more columns and then filter the groups based on certain criteria. Here’s how it works:

  1. Start by selecting the columns you want to check for duplicates using the SELECT statement.
  2. Use the GROUP BY clause to group the rows by the selected columns.
  3. Use the COUNT function in the HAVING clause to filter the groups that have more than one row. These are the groups that contain duplicates.

For example, let’s say you have a table called “customers” with columns for “name” and “email”. You want to find all customers who have registered with the same email address. Here’s what the SQL query would look like:

SELECT name, email, COUNT(*) 
FROM customers 
GROUP BY email 
HAVING COUNT(*) > 1;

This query groups the customers by their email addresses and then counts the number of customers in each group. The HAVING clause filters out groups that have only one customer, leaving only the groups with duplicate email addresses.

It’s important to note that the GROUP BY clause must include all the columns you’re selecting except for the ones that use an aggregate function like COUNT. Otherwise, the query will return an error.

Another thing to keep in mind is that the COUNT function counts all rows in each group, not just the unique ones. So if you have multiple rows with the same name and email address, they will all be counted as duplicates.

In addition to using the GROUP BY and HAVING clauses, you can also use other SQL statements like ORDER BY, WHERE, and JOIN to further refine your search for duplicates. You can also use unique constraints to prevent duplicate values from being inserted into a table in the first place.

Overall, finding duplicates in SQL can be a powerful tool for cleaning up uncleaned data or identifying potential human errors or application bugs. By using the right search criteria and outputting the results in a clear and concise way, you can quickly identify duplicate rows or values and take action to correct them.

Using COUNT Function

One way to find duplicates in SQL is by using the COUNT function. The COUNT function is an aggregate function that counts the number of rows in a table that meet a certain condition. By using the COUNT function, you can count the number of occurrences of a particular value in a column and identify duplicates.

To use the COUNT function to find duplicates, you will need to group the rows by the column that you want to check for duplicates. The GROUP BY clause is used to group the rows based on the values in a specific column. For example, if you want to find duplicates in the “email” column of a “users” table, you would group the rows by the “email” column.

Here is an example query that uses the COUNT function to find duplicates in the “email” column of a “users” table:

SELECT email, COUNT(email) as count
FROM users
GROUP BY email
HAVING COUNT(email) > 1;

In this query, the GROUP BY clause groups the rows by the “email” column, and the COUNT function counts the number of occurrences of each email address. The HAVING clause filters the results to only show the email addresses that have more than one occurrence.

The result of this query will be a table that shows the email addresses that have duplicates and the number of occurrences of each email address. You can use this information to identify and remove the duplicate rows from the table.

Using the COUNT function is a simple and effective way to find duplicates in SQL. It allows you to quickly identify the duplicate values in a column and take action to remove them from the table.

Using INNER JOIN Clause

One of the most common ways to find duplicates in SQL is by using the INNER JOIN clause. This clause allows you to combine two or more tables based on a common column, and return only the rows that have matching values in both tables.

To use the INNER JOIN clause to find duplicates, you will need to identify the columns that contain the duplicate data. Once you have identified these columns, you can use the INNER JOIN clause to join the table on these columns.

Here is an example of how to use the INNER JOIN clause to find duplicates in a table called “orders”:

SELECT o1.order_id, o2.order_id
FROM orders o1
INNER JOIN orders o2
ON o1.customer_id = o2.customer_id
AND o1.order_date = o2.order_date
AND o1.order_id <> o2.order_id;

In this example, the INNER JOIN clause is used to join the “orders” table to itself, using the “customer_id” and “order_date” columns as the join criteria. The “AND” operator is used to specify that all three columns must match in order for a row to be returned. The final line of the query, “AND o1.order_id <> o2.order_id”, ensures that the query does not return rows where both “order_id” values are the same.

The result of this query will be a list of all orders that have the same customer ID and order date, but different order IDs. These are the duplicate orders that need to be investigated further.

Using the INNER JOIN clause in this way can be a powerful tool for finding duplicates in SQL. However, it is important to be careful when using this method, as it can be easy to accidentally return false positives or miss duplicates that are spread across multiple tables. It is always a good idea to double-check your results and use other methods, such as GROUP BY and HAVING clauses, to confirm your findings.

Finding Duplicates Using Common Table Expressions (CTE)

One of the ways to find duplicates in SQL is by using Common Table Expressions (CTE). A CTE is a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement.

To find duplicates using CTE, one can use the ROW_NUMBER() function, which assigns a unique sequential number to each row within a partition of a result set. The PARTITION BY clause specifies the columns used to define the partition, and the ORDER BY clause specifies the order of the rows within each partition.

Here’s an example of using CTE to find duplicates:

WITH CTE AS (
   SELECT column1, column2, column3, ROW_NUMBER() OVER(PARTITION BY column1, column2, column3 ORDER BY column1, column2, column3) AS RowNumber
   FROM table_name
)
SELECT *
FROM CTE
WHERE RowNumber > 1

In this example, the CTE is defined with the columns to be checked for duplicates. The ROW_NUMBER() function is used to generate a sequence number for each row within a partition of the result set, where the partition is defined by the columns specified in the PARTITION BY clause. The result set is then filtered to only show rows where the RowNumber is greater than 1, indicating that there are duplicates.

It’s important to note that the ORDER BY clause within the ROW_NUMBER() function must match the ORDER BY clause in the main SELECT statement, or else the results may not be accurate.

Using CTE to find duplicates can be especially useful in SQL Server 2017, which introduced support for graph processing using Common Table Expressions. This allows for more complex queries involving relationships between data.

Overall, using CTE to find duplicates in SQL can be a powerful tool for data analysis and management.

Finding Duplicates Using Window Functions

When it comes to finding duplicates in SQL, one of the most efficient ways is by using window functions. Window functions are a powerful tool that can be used to perform calculations across a set of rows that are related to the current row.

To find duplicates using window functions, you can use the ROW_NUMBER() function in combination with a PARTITION BY clause. The ROW_NUMBER() function assigns a unique number to each row within a partition, and the PARTITION BY clause groups rows into partitions based on a specific column or set of columns.

For example, suppose you have a table of users that includes columns for user_id, username, and email. To find all the users who have duplicate emails, you can use the following select statement:

SELECT user_id, username, email
FROM (
  SELECT user_id, username, email,
    ROW_NUMBER() OVER (PARTITION BY email ORDER BY user_id) AS row_num
  FROM users
) AS subquery
WHERE row_num > 1;

In this example, the PARTITION BY clause partitions the rows by the email column, and the ROW_NUMBER() function assigns a unique number to each row within the partition based on the user_id column. The WHERE clause filters out all rows except those with a row_num greater than 1, which are the rows with duplicate emails.

It’s important to note that window functions can be used in combination with other SQL functions, such as CASE statements and aggregate functions, to create more complex queries. For example, you could use a CASE statement to group rows into a specific category, and then use a window function to find duplicates within that category.

In conclusion, using window functions to find duplicates in SQL can be a powerful and efficient method. By using the ROW_NUMBER() function in combination with a PARTITION BY clause, you can easily group rows into partitions and assign unique numbers to each row within the partition. This allows you to quickly identify and remove duplicate rows from your results.

Using ROW_NUMBER() Function with PARTITION BY Clause

When searching for duplicates in a SQL database, one useful tool is the ROW_NUMBER() function with the PARTITION BY clause. This function assigns a unique number to each row in a result set, based on the specified partitioning criteria.

The PARTITION BY clause allows the user to group the rows into partitions based on one or more columns. This means that the ROW_NUMBER() function will assign a unique number to each row within each partition, rather than across the entire result set.

For example, consider a table of customer orders with columns for order ID, customer ID, and order date. To find duplicate orders for each customer, one could use the following query:

SELECT order_id, customer_id, order_date, 
       ROW_NUMBER() OVER (PARTITION BY customer_id, order_date 
                          ORDER BY order_id) AS row_num
FROM orders

In this query, the ROW_NUMBER() function is used with the PARTITION BY clause to group the orders by customer and order date. The function then assigns a unique number to each row within each partition, based on the order ID.

The resulting table will have an additional column called “row_num” that contains the assigned row numbers. Duplicates can then be identified by selecting rows with a row_num greater than 1.

Using the ROW_NUMBER() function with the PARTITION BY clause can be a powerful tool for finding duplicates in SQL databases. By grouping the rows into partitions based on specific criteria, the function can assign unique numbers to each row within each partition, making it easier to identify duplicates.

Using CASE Statement with ROW_NUMBER() Function

In SQL, the ROW_NUMBER() function is used to assign a unique sequential number to each row in a result set. This function can be used in conjunction with the CASE statement to identify duplicate records in a table.

The CASE statement allows for conditional logic to be applied to each row in a result set. By utilizing the ROW_NUMBER() function within the CASE statement, the number of occurrences of each record can be determined.

For example, the following SQL query can be used to find duplicate records in a “users” table based on the “first_name” and “last_name” columns:

SELECT *, 
CASE 
    WHEN ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) > 1 
    THEN 'Duplicate' 
    ELSE 'Unique' 
END AS duplicate_status 
FROM users;

The query above partitions the result set by the “first_name” and “last_name” columns and orders the rows by the “id” column. The ROW_NUMBER() function assigns a unique sequential number to each row within each partition.

The CASE statement then checks if the ROW_NUMBER() value is greater than 1. If it is, then the row is identified as a duplicate. If not, the row is identified as unique.

The result set will include an additional column called “duplicate_status” that displays either “Duplicate” or “Unique” for each row.

idfirst_namelast_nameemailduplicate_status
1JohnSmithjohn@example.comUnique
2JaneDoejane@example.comUnique
3JohnSmithjohn.smith@example.comDuplicate
4BobJohnsonbob@example.comUnique
5JaneDoejane.doe@example.comDuplicate

In the example above, rows 1, 2, and 4 are identified as unique, while rows 3 and 5 are identified as duplicates based on the “first_name” and “last_name” columns.

By utilizing the ROW_NUMBER() function within the CASE statement, duplicate records can be easily identified and managed in a SQL table.

Related articles