Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

Tip

Tip: If you are attempting to identify if there are duplicate rows, check the row count in your dataset before and after you have added this transform.

d-

...

trans
Typestep
SearchTermRemove duplicate rows

Limitations:

  • This transform is case-sensitive. So, if a column has values Hello and HELLO, the rows containing those values are not considered duplicates and cannot be removed with this transform.
  • Whitespace and the beginning and ending of values is not ignored. 

...

Info

NOTE: To preserve the original column values, use the derive transform. The set transform replaces the original values.

d-

...

trans
Type

...

step

...

p01Name

...

Formula type
p01ValueSingle row formula
p02NameFormula
p02ValueTRIM(Item)
SearchTermNew formula

Since the deduplicate transform is case-sensitive, you can use the LOWER function to make the case of each entry in a column to be consistent:

d-

...

trans
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueLOWER(Description)
SearchTermNew formula

For more information, see  Normalize Numeric Values. 

Deduplicate Rows Based on a Primary Key

...

  1. If your primary key spans multiple columns, use the merge transform to bring the values into a single column:

    d-

    codemerge col:

    trans
    Typestep
    p01NameColumns
    p01ValueRestaurantName,Address,Zip

    with:

    p02NameSeparator
    p02Value'-'
    SearchTermMerge columns

  2. Rename the generated column: PrimaryKey.

  3. Use the following transform to generate a new column, comparing each value in the PrimaryKey column to the previous one:

    d-

    codewindow value:

    trans
    Typestep
    p01NameFormulas
    p01ValuePREV(PrimaryKey, 1)

    order: PrimaryKey

    p02NameOrder by
    p02ValuePrimaryKey
    SearchTermWindow

  4. For each row, the value of the new column is the value in the PrimaryKey for the previous row. Now, test if this value is the same as the value in the PrimaryKey column for the current row:

    d-

    codederive type:single value:

    trans
    p03ValueIsDupe
    Typestep
    p01NameFormula type
    p01ValueSingle row formula
    p02NameFormula
    p02ValueIF((window==PrimaryKey)

    ?

    ,true

    :

    ,false)
    p03NameNew column name
    SearchTermNew formula

  5. The new column (IsDupe) contains true for duplicate primary keys. Delete the rows that are duplicates:

    d-

    codedelete row:

    trans
    Typestep
    DescriptionCondition
    p01Name(IsDupe==true)
    SearchTermDelete rows

  6. Drop any generated columns that are no longer needed. 

...

While this form of duplicate data is rarer, you might want to check on the possibility of duplicate data between your columns. To check for duplicate column data, you can use a transform similar to the following:

d-

...

trans
p03Value'dupeColVals'
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueColumn1 == Column2
p03NameNew column name
SearchTermNew formula

In the generated column, values that are true indicate duplicate data. If all values are true, then you can remove one of the columns.