Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

In some cases, it might be acceptable to have duplicated data. For example, additional records using the same primary key might be included in a dataset as amendments or detail records. 

Info

NOTE: Before you remove duplicates from your dataset, you should verify that the data should not contain duplicates at all. If the data structure supports some duplicate elements including key values, you should exercise care in how you identify what constitutes duplicate information.

Deduplicate Transform

D s product
rtrue
 provides provides a single transform, which can remove identical rows from your dataset:

...

  • This transform is case-sensitive. So, if a column has values values Hello and  and HELLO, the rows containing those values are not considered duplicates and cannot be removed with this transform.
  • Whitespace and the beginning and ending of values is not ignored. 

Before applying the the deduplicate transform, you should attempt to normalize your data. You can use the following techniques to normalize a few columns of data.

Info

NOTE: If you have more than 20 columns of data, you might be better served by trying to identify a primary key method for de-duplicating your dataset. Details are below.

For individual columns, you can use the the trim function to remove leading and trailing whitespace:

...

D trans
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueTRIM(Item)
SearchTermNew formula

Since the the deduplicate transform is case-sensitive, you can use the the LOWER function function to make the case of each entry in a column to be consistent:

D trans
Typestep
p01NameFormula type
p01ValueSingle row formula
p02NameFormula
p02ValueLOWER(Description)
SearchTermNew formula

For more information, see see Normalize Numeric Values . 

Deduplicate Rows Based on a Primary Key

 

An easier method to An easier method to deduplicate data might be to delete rows based on one or more columns that you identify as a primary key for the dataset.  A  A primary key  is key is an identifier that uniquely identifies a row of data within a dataset. It can be a single field (column) or a combination of columns. For example, in a datasets of restaurant locations, the primary key can be a combination of RestaurantName, Address, and Zip. 

 

Info

NOTE: Before continuing, you must identify a primary key for your dataset. See See Generate Primary Keys.

 

When you have identified your primary key, you should identify the appropriate method for your dataset. Please complete the following steps.

 

Steps:

 

  1. If your primary key spans multiple columns, use the the merge transform transform to bring the values into a single column:

    D trans
    Typestep
    p01NameColumns
    p01ValueRestaurantName,Address,Zip
    p02NameSeparator
    p02Value'-'
    SearchTermMerge columns

  2. Rename the generated column:   PrimaryKey.

  3. Use the following transform to generate a new column, comparing each value in the the PrimaryKey column column to the previous one:

    D trans
    Typestep
    p01NameFormulas
    p01ValuePREV(PrimaryKey, 1)
    p02NameOrder by
    p02ValuePrimaryKey
    SearchTermWindow

  4. For each row, the value of the new column is the value in the the PrimaryKey for for the previous row. Now, test if this value is the same as the value in the the PrimaryKey column column for the current row:

    D trans
    p03ValueIsDupe
    Typestep
    p01NameFormula type
    p01ValueSingle row formula
    p02NameFormula
    p02ValueIF((window==PrimaryKey),true,false)
    p03NameNew column name
    SearchTermNew formula

  5. The new column (IsDupe)  contains contains true for for duplicate primary keys. Delete the rows that are duplicates:

    D trans
    Typestep
    DescriptionCondition
    p01Name(IsDupe==true)
    SearchTermDelete rows

  6. Drop any generated columns that are no longer needed. 

Deduplicate Columns

While this form of duplicate data is rarer, you might want to check on the possibility of duplicate data between your columns. To check for duplicate column data, you can use a transform similar to the following:

...

In the generated column, values that are are true indicate duplicate data. If all values are are true, then you can remove one of the columns.