How to Remove Duplicate Records & Data in SQL? A Step-by-Step Guide with Examples

By Cristian G. Guasch • Updated: 05/17/23 • 9 min read

Removing duplicate records from SQL databases is a common task for database administrators. Duplicates can occur due to various reasons such as software bugs, user errors, or data migration issues. If left unattended, duplicates can cause data inconsistencies, slow down queries, and consume valuable storage space. Therefore, it is important to learn how to remove duplicates in SQL.

Related: How to Find Duplicates in SQL: A Step-by-Step Guide

There are several methods to remove duplicates in SQL, and the choice of method depends on various factors such as the size of the database, the number of duplicates, the complexity of the schema, and the performance requirements. Some of the common methods include using the DISTINCT keyword, using the GROUP BY clause, using the ROW_NUMBER function, using the DELETE statement with a self-join, and using Common Table Expressions (CTEs). Each method has its own advantages and disadvantages, and it is important to choose the right method for the specific scenario.

Understanding Duplicates in SQL

What are Duplicates?

Duplicates in SQL refer to records or rows in a database table that have identical values in one or more columns. For instance, if a table contains customer data, two or more records with the same name, address, and phone number would be considered duplicates.

Why are Duplicates a Problem?

Duplicates can cause several problems in a database, including:

  • Data inconsistency: Duplicate data can lead to inconsistencies in the database, making it difficult to maintain data quality and accuracy.
  • Increased storage space: Duplicate data occupies additional storage space, which can increase storage costs and slow down database performance.
  • Poor query performance: Queries that involve duplicates can take longer to execute, leading to slower query performance.
  • Data redundancy: Duplicate data can lead to redundant information, which can make it difficult to maintain data integrity.

To address these issues, it is essential to remove duplicates from the database. The next section will discuss how to remove duplicates in SQL.

Identifying Duplicate Rows

In SQL, duplicate rows refer to rows that have the same values in all columns, and they can be problematic when working with databases. Identifying and removing duplicate rows is an important task in database management, and there are several ways to accomplish this.

Using SELECT DISTINCT

One way to identify and remove duplicate rows is by using the SELECT DISTINCT statement. This statement returns only unique values in a specific column or set of columns. For example, to find all unique values in the “name” column of a table called “students,” one would use the following query:

SELECT DISTINCT name FROM students;

This query would return a list of all unique names in the “students” table. However, it is important to note that using SELECT DISTINCT can be slow and resource-intensive, especially on large tables.

Using GROUP BY

Another way to identify and remove duplicate rows is by using the GROUP BY clause. This clause groups rows that have the same values in a specific column or set of columns. For example, to group rows in a table called “sales” by the “product” column and find the total sales for each product, one would use the following query:

SELECT product, SUM(sales) FROM sales GROUP BY product;

This query would return a table with each unique product and its total sales. Using GROUP BY is faster than using SELECT DISTINCT, but it requires more specific knowledge of the table’s structure.

Using ROW_NUMBER() Function

Finally, one can use the ROW_NUMBER() function to identify and remove duplicate rows. This function assigns a unique number to each row in a table, which can then be used to filter out duplicates. For example, to remove duplicate rows from a table called “employees” based on the “name” column, one would use the following query:

WITH CTE AS (
    SELECT name, ROW_NUMBER() OVER(PARTITION BY name ORDER BY name) AS rn
    FROM employees
)
DELETE FROM CTE WHERE rn > 1;

This query creates a common table expression (CTE) that assigns a row number to each row in the “employees” table based on the “name” column. It then deletes all rows with a row number greater than 1, effectively removing duplicate rows.

In conclusion, identifying and removing duplicate rows in SQL is an important task in database management. There are several ways to accomplish this, including using SELECT DISTINCT, GROUP BY, and the ROW_NUMBER() function. Each method has its advantages and disadvantages, and the choice of method depends on the specific requirements of the task at hand.

Removing Duplicate Rows

Removing duplicate rows in SQL is a common task that database administrators and developers perform. There are several ways to remove duplicate rows, including using the DELETE statement, using a Common Table Expression (CTE), and creating a new table with unique values. In this section, we will explore these methods in detail.

Using DELETE Statement

One way to remove duplicate rows is by using the DELETE statement. To do this, you can use the GROUP BY clause and the HAVING clause to identify the duplicate rows and then delete them from the result set. Here is an example:

DELETE FROM sample_table
WHERE id NOT IN (SELECT MIN(id)
                 FROM sample_table
                 GROUP BY column1, column2, column3);

In this example, the MIN function is used to identify the unique rows based on the specified columns. The result set is then filtered using the NOT IN operator to exclude the unique rows and delete the duplicate rows.

Using Common Table Expression (CTE)

Another way to remove duplicate rows is by using a Common Table Expression (CTE). This method is available in SQL Server 2005 and later versions. Here is an example:

WITH CTE AS (
    SELECT column1, column2, column3, ROW_NUMBER() OVER (PARTITION BY column1, column2, column3 ORDER BY id) AS RN
    FROM sample_table
)
DELETE FROM CTE
WHERE RN > 1;

In this example, the ROW_NUMBER() function is used to assign a unique value to each row based on the specified columns. The PARTITION BY clause is used to group the rows by the specified columns, and the ORDER BY clause is used to sort the rows within each group. The CTE is then used to delete the duplicate rows where the RN value is greater than 1.

Creating a New Table with Unique Values

Finally, you can create a new table with unique values by using the DISTINCT keyword or the RANK() function. Here is an example:

SELECT DISTINCT column1, column2, column3
INTO new_table
FROM sample_table;

In this example, the DISTINCT keyword is used to select the unique rows based on the specified columns, and the INTO clause is used to create a new table with the unique values. Alternatively, you can use the RANK() function to assign a unique value to each row based on the specified columns and then select the rows with a rank of 1.

In conclusion, there are several methods to remove duplicate rows in SQL, including using the DELETE statement, using a Common Table Expression (CTE), and creating a new table with unique values. Each method has its advantages and disadvantages, and the choice depends on the specific requirements of the task at hand.

Best Practices for Removing Duplicates

Removing duplicates from a SQL table is a common task that requires careful consideration to maintain data integrity and optimize performance. Here are some best practices to follow when removing duplicates from a production table.

Designing Tables with Relevant Keys

Designing tables with relevant keys is crucial to prevent duplicates from being inserted in the first place. Primary keys, identity columns, and clustered indexes can help ensure that each row is unique.

For example, a table containing customer information might have a primary key based on a unique identifier such as a customer ID. This ensures that each customer is entered only once into the table. Similarly, an identity column can be used to automatically generate a unique value for each row.

Using Constraints for Data Integrity

Constraints can be used to enforce data integrity and prevent duplicates from being inserted. For example, a unique constraint can be added to a specific column to ensure that no two rows have the same value in that column.

Constraints can also be used to enforce referential integrity between related tables. This ensures that data in one table is consistent with data in another table.

Optimizing Performance

Removing duplicates from a large table can be a time-consuming process. To optimize performance, it is important to use efficient queries and minimize the number of table scans.

One way to remove duplicates efficiently is to use the ROW_NUMBER() function in a subquery to identify duplicate rows. For example, to remove duplicates from an employee table based on first name and last name, the following query can be used:

WITH CTE AS (
    SELECT first_name, last_name, ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY first_name) AS RN
    FROM employees
)
DELETE FROM CTE WHERE RN > 1;

Another way to optimize performance is to use a HAVING clause in the SELECT statement to filter out duplicates. For example, to remove duplicates from a table based on a specific column, the following query can be used:

SELECT MAX(column_name)
FROM production_table
GROUP BY specific_column
HAVING COUNT(*) > 1;

Examples

Here are a few examples of how to remove duplicates from a table using the best practices discussed above:

  • Removing duplicates from a table using a SSIS package
  • Removing duplicates from a table using a DELETE statement with a subquery
  • Removing duplicates from a table using a SELECT statement with a HAVING clause

By following these best practices, you can ensure that duplicates are removed efficiently and that data integrity is maintained.

Related articles