Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

An easier method to deduplicate data might be to delete rows based on one or more columns that you identify as a primary key for the dataset. A primary key is an identifier that uniquely identifies a row of data within a dataset. It can be a single field (column) or a combination of columns. For example, in a datasets of restaurant locations, the primary key can be a combination of RestaurantName, Address, and Zip.

 

Info

NOTE: Before continuing, you must identify a primary key for your dataset. See Generate Primary Keys.

 

When you have identified your primary key, you should identify the appropriate method for your dataset. Please complete the following steps. 

Steps:

 

  1. If your primary key spans multiple columns, use the merge transform to bring the values into a single column:

    D trans
    Typestep
    p01NameColumns
    p01ValueRestaurantName,Address,Zip
    p02NameSeparator
    p02Value'-'
    SearchTermMerge columns

  2. Rename the generated column: PrimaryKey.

  3. Use the following transform to generate a new column, comparing each value in the PrimaryKey column to the previous one:

    D trans
    Typestep
    p01NameFormulas
    p01ValuePREV(PrimaryKey, 1)
    p02NameOrder by
    p02ValuePrimaryKey
    SearchTermWindow

  4. For each row, the value of the new column is the value in the PrimaryKey for the previous row. Now, test if this value is the same as the value in the PrimaryKey column for the current row:

    D trans
    p03ValueIsDupe
    Typestep
    p01NameFormula type
    p01ValueSingle row formula
    p02NameFormula
    p02ValueIF((window==PrimaryKey),true,false)
    p03NameNew column name
    SearchTermNew formula

  5. The new column (IsDupe) contains true for duplicate primary keys. Delete the rows that are duplicates:

    D trans
    Typestep
    DescriptionCondition
    p01Name(IsDupe==true)
    SearchTermDelete rows

  6. Drop Delete any generated columns that are no longer needed.

...