Deduplicate Rows Based on a Primary Key
An easier method to An easier method to deduplicate data might be to delete rows based on one or more columns that you identify as a primary key for the dataset. A A primary key is is an identifier that uniquely identifies a row of data within a dataset. It can be a single field (column) or a combination of columns. For example, in a datasets of restaurant locations, the primary key can be a combination of RestaurantName, Address, and Zip.
NOTE: Before continuing, you must identify a primary key for your dataset. See Generate Primary Keys.
If your primary key spans multiple columns, use the
mergetransform transform to bring the values into a single column:
merge col:RestaurantName,Address,Zip with:'-'
- Rename the generated column:
From the the
PrimaryKeycolumn drop-down, select Sort ascendingselect Sort ascending.
NOTE: If it is important to keep the first entry of a specific column in your dataset, you can add that column as a secondary sort column, as in the following transform:
sort order: PrimaryKey,-YearsMember
The dash before the secondary column indicates to sort in descending order.
Use the following transform to generate a new column, comparing each value in the
PrimaryKeycolumn column to the previous one:
window value: PREV(PrimaryKey, 1) order: PrimaryKey
For each row, the value of the new column is the value in the
PrimaryKeyfor for the previous row. Now, test if this value is the same as the value in the
PrimaryKeycolumn column for the current row:
derive value:((window==PrimaryKey) ? true : false)
The new column (
truefor for duplicate primary keys. Delete the rows that are duplicates:
- Drop any generated columns that are no longer needed.
While this form of duplicate data is rarer, you might want to check on the possibility of duplicate data between your columns. To check for duplicate column data, you can use a transform similar to the following: