Page tree

 

Support | BlogContact Us | 844.332.2821

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

Deduplicate Rows Based on a Primary Key

An easier method to An easier method to deduplicate data might be to delete rows based on one or more columns that you identify as a primary key for the dataset.   A  primary key is  is an identifier that uniquely identifies a row of data within a dataset. It can be a single field (column) or a combination of columns. For example, in a datasets of restaurant locations, the primary key can be a combination of RestaurantName, Address, and Zip. 

Info

NOTE: Before continuing, you must identify a primary key for your dataset. See Generate Primary Keys.

...

  1. If your primary key spans multiple columns, use the merge transform  transform to bring the values into a single column:

    D code

    merge col:RestaurantName,Address,Zip with:'-'

  2. Rename the generated column: PrimaryKey.
  3. From the the PrimaryKey column drop-down, select Sort ascendingselect Sort ascending.

    Info

    NOTE: If it is important to keep the first entry of a specific column in your dataset, you can add that column as a secondary sort column, as in the following transform:

    D code

    sort order: PrimaryKey,-YearsMember

    The dash before the secondary column indicates to sort in descending order.

  4. Use the following transform to generate a new column, comparing each value in the PrimaryKey column  column to the previous one:

    D code

    window value: PREV(PrimaryKey, 1) order: PrimaryKey

  5. For each row, the value of the new column is the value in the PrimaryKey for  for the previous row. Now, test if this value is the same as the value in the PrimaryKey column  column for the current row:

    D code

    derive value:((window==PrimaryKey) ? true : false)

  6. The new column (IsDupe) contains true for  for duplicate primary keys. Delete the rows that are duplicates:

    D code

    delete row:(IsDupe==true)

  7. Drop any generated columns that are no longer needed. 

Deduplicate Columns

While this form of duplicate data is rarer, you might want to check on the possibility of duplicate data between your columns. To check for duplicate column data, you can use a transform similar to the following:

...