NOTE: Before you remove duplicates from your dataset, you should verify that the data should not contain duplicates at all. If the data structure supports some duplicate elements including key values, you should exercise care in how you identify what constitutes duplicate information.
Remove duplicate rows transformation
|D s product|
Tip: If you are attempting to identify if there are duplicate rows, check the row count in your dataset before and after you have added this transformtransformation.
- This transform transformation is case-sensitive. So, if a column has values
HELLO, the rows containing those values are not considered duplicates and cannot be removed with this transformtransformation.
- Whitespace and the beginning and ending of values is not ignored.
Before applying the
Remove deduplicate transform
rows transformation, you should attempt to normalize your data. You can use the following techniques to normalize a few columns of data.
NOTE: To preserve the original column values, use the
Remove deduplicate transform
rows transformation is case-sensitive, you can use the
LOWER function to make the case of each entry in a column to be consistent:
Deduplicate Rows Based on a Primary Key
An easier Another method to deduplicate data might be to delete rows based on one or more columns that you identify as a primary key for the dataset. A primary key is an identifier that uniquely identifies a row of data within a dataset. It can be a single field (column) or a combination of columns. For example, in a datasets of restaurant locations, the primary key can be a combination of RestaurantName, Address, and Zip.
If your primary key spans multiple columns, use the
Merge columnstransformation to bring the values into a single column:
D trans Type step p01Name Columns p01Value RestaurantName,Address,Zip p02Name Separator p02Value '-' SearchTerm Merge columns
Rename the generated column:
Use the following transform transformation to generate a new column, comparing each value in the
PrimaryKeycolumn to the previous one:
D trans Type step p01Name Formulas p01Value PREV(PrimaryKey, 1) p02Name Order by p02Value PrimaryKey SearchTerm Window
For each row, the value of the new column is the value in the
PrimaryKeyfor the previous row. Now, test if this value is the same as the value in the
PrimaryKeycolumn for the current row:
D trans p03Value IsDupe Type step p01Name Formula type p01Value Single row formula p02Name Formula p02Value IF((window==PrimaryKey),true,false) p03Name New column name SearchTerm New formula
The new column (
truefor duplicate primary keys. Delete the rows that are duplicates:
D trans Type step Description Condition p01Name (IsDupe==true) SearchTerm Delete rows
- Delete any generated columns that are no longer needed.
While this form of duplicate data is rarer, you might want to check on the possibility of duplicate data between your columns. To check for duplicate column data, you can use a transform transformation similar to the following: