Page tree


Contents:

   

Contents:


When needed, you can validate a column of values against a pre-defined set of values maintained in a separate dataset. This method of data validation is most useful for String-based data that does not easily map to a specific pattern of values.

Overview 

This method of validation is completed through the following general steps:

  1. Prepare your validation dataset. Create a dataset containing the unique values against which you wish to validate. 
  2. Import your validation dataset. After you have prepared the dataset externally, you should import it into the Designer Cloud application .  
  3. Join your data to your validation dataset. You perform a join from the dataset you're wrangling to the validation dataset that you imported. Validation errors should be identifiable as missing values in the validation column.
  4. Triage defects as necessary. For rows that cannot be resolved, additional wrangling may be necessary.
  5. Standardize data. You can use the Standardize tool to review the differences between invalid data and valid data.  

Example

This approach is best demonstrated by example. Below, you can see a set of orders for product.

productNamecustomerNameQtytotalSales
Product ADACustomer ABC226
Product AEVCustomer DEF4100
Product DXLCustomer EFG642
Product EDMCustomer ABC126
Product JTOCustomer DEF375
Product JUBCustomer EFG535
Product NRSCustomer ABC626
product NSECustomer DEF8200
Product ZZZCustomer EFG1080

Notes:

  • You can see that this set of orders is spread across 10 different products for three different customers. 
  • In the productName column, there is a mismatch in capitalization. 
  • The final productName value (Product ZZZ) does not exist.

The product names in this list must be validated against a dataset containing the list of all available products. This list is 100 product names long. 

You can use the links below to download these datasets as CSV files for exploration in your project or workspace. 

Prepare Validation Dataset

If you haven't done so already, you should prepare your validation dataset for use in the  Designer Cloud application . Below, you can see the first 10 rows of the ProductNames dataset:

productName
Product ADA
Product AEV
Product ANH
Product ARA
Product ARM
Product AUJ
Product BAD
Product BAP
Product BEI
Product BEZ

Notes:

  • A column header is provided in the dataset. This is helpful for identifying the column to use later as the join key. 
  • You may wish to enter a validation column, simply contains the value TRUE

    Tip: If your dataset does not contain this column, you can create a new formula within the Transform Builder to insert this value. This step is covered later.

Import Validation Dataset

If you have prepared your dataset, you must import into in the application. 

Steps:

  1. In the  Designer Cloud application , click Library.
  2. Click Import Data.
  3. Navigate to the file or files to import. 

    Tip: If you are using the example files, you can right-click them above, download them to your desktop, and then drag and drop them into the Import Data page.

  4. Import the file as a new dataset.

    Tip: If you are using the example datasets, you can call it Reference-ProductNames.

  5. It may be helpful to import the file into a new file and create a recipe from it. 

For the example dataset, there is a single column of values. To make this dataset useful as a validation dataset, add the following transformation, which adds a second column called validation containing the value true for each row.

Transformation Name New formula
Parameter: Formula type Single row formula
Parameter: Formula 'true'
Parameter: New column name validation

The example reference dataset now looks like the following:

productNamevalidation
Product ADATRUE
Product AEVTRUE
Product ANHTRUE
Product ARATRUE
Product ARMTRUE
Product AUJTRUE
Product BADTRUE
Product BAPTRUE
Product BEITRUE
Product BEZTRUE

Join with Validation Dataset

Now you can join your existing dataset with the new validation dataset. A join performs a comparison of the column values in one dataset compared to the column values in another dataset. Where matches are detected, columns and values of the joined-in dataset are inserted into the source dataset. For more information on joins, see Join Types.

Steps:

  1. Edit the recipe of the source dataset. 
  2. As a new step for the recipe, enter join datasets in the Search panel.
  3. Select the source of the joined-in data:
    1. If you created a recipe and added steps (as in the example), then click the Recipes in current flow tab.
    2. If you imported a clean dataset, then click one of the Datasets tabs.
  4. For Join type, select Left.

    Tip: A left join includes all rows from the left (source) dataset and only the matching rows from the right dataset for a specified set of column values (join keys) in the left dataset. If a column value in the left dataset does not exist in the right dataset, then null values are listed for that row's entry for all columns imported from the right dataset.

  5. For the Join keys, select the column containing values to check from the left (source) dataset and then column containing the reference values in the right dataset. 

    Tip: In the example datasets, both of these columns are called ProductNames, which assists the join tool in identifying the join key columns.

  6. Under Join Keys, hover over one of the column names. Then, click the Pencil icon.

    Figure: Edit join keys

  7. When you edit the join keys, you can specify the Condition, which defines the type of comparison that is performed to determine a match. 

  8. The other options allow you to fine-tune how matching is performed. In particular, the Ignore case option is off, which means that by default, joins are case-sensitive. So, product 01 does not match to Product 01.  

    Tip: In the example data, you can see that the product NSE entry does not have a match, which is due to differences in case. If Ignore case is enabled, then this entry may find a match. However, you may wish to maintain case-sensitive searches to ensure that you can clean up the data correctly.

  9. Click Save and continue.
  10. Click Next.
  11. Select all columns, and click Review.

    Figure: Review join

    Tip: In the example, you can see that two rows failed to match.

  12. Click Add to recipe.

Triage Invalid Data

You should now have a dataset containing all columns from both datasets. 

Tip: In the example dataset, a second column called validation was added. This column contains null values for the mismatched rows. So, you can delete the duplicate ProductName column, which contains null values in two rows.

You can make some decisions on how to triage the invalid values in the ProductNames column.

Insert error messages

You can use a transformation to replace the null values in the validation column with a meaningful message:

Transformation Name Edit column with formula
Parameter: Columns validation
Parameter: Formula IF ($col == NULL(),'Error - invalid Product Name','ok')

The transformation replaces the null values with Error - invalid Product Name and writes ok for the other rows.

Delete invalid rows

If the entire row of data is invalid because of the invalid value, then you simply delete the row. This transformation deletes rows, where the validation column contains a null value:

Transformation Name Filter rows
Parameter: Condition Custom formula
Parameter: Type of formula Custom single
Parameter: Condition (validation == NULL())
Parameter: Action Delete matching rows

Update validation dataset with new values

If you discover that some of the mismatched rows should be part of your validation dataset, you can follow these general steps to add them.

Steps:

  1. In Flow View, add a new recipe off of the recipe where the join occurs. 
  2. Edit this new recipe. 
  3. Insert this transformation to filter the dataset down to the rows containing the new values:

    Transformation Name Filter rows
    Parameter: Condition Custom formula
    Parameter: Type of formula Custom single
    Parameter: Condition (validation == NULL())
    Parameter: Action Keep matching rows
  4. You can then delete all columns except (in the example) the ProductName column. From the column menu of the ProductName column, select Delete others.
  5. You should now have a single column containing the missing values. You can create a union between this dataset and the validation dataset to add the values. See Append Datasets.
  6. Run a job on this recipe to generate the output file. 
  7. This output file can then be used to replace the source data. In Flow View, select the imported dataset. From the context panel, open the More menu and select Replace.

Standardize invalid data

You can standardize values in join key column for your source data, which may address some of the invalid data. The Standardization tool attempts to cluster similar values together within the column, so that you can try to identify if your values in your source dataset can be matched to values in the target dataset.

Tip: After you have identified issues, this step may be best to apply before the join. By applying it before the join, some or all of the mismatched value issues may be addressed.

For your source column, you can select Standardize from the column menu. 

In the example, the ProductNames column values are being standardized. In this case, the lower case product NSE value has been corrected to be Product NSE

Figure: Standardize join key values

For more information, see Overview of Cluster Clean.

See Also for Validate Column Values against a Dataset:

This page has no comments.