The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.
Data validation can be broken down into the following categories:
In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (blackgray) values.
When you click the red bar:
At the top of each column, the data quality bar includes a black gray bar indicating the number of cells in the column that do not contain values. This set of values includes missing values.
Click the black gray bar to prompt for a set of suggestion cards for handling those values.
Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line.
Tip: To your second dataset, you might want to add a second column containing the value
true, which allows you to keep separate validation data from the columns that you join.
- Add this CSV file as a new dataset to your flow.
- Open your source dataset. In the Search panel, enter
- In the Join window:
- Select the reference dataset you just created. Click Accept. Click Next.
- Select the type of join to perform:
- Right outer join: Select this join type if you want to delete rows in your source dataset that do not have a key value in the reference dataset. In the example, all rows that do not have a value in the State column would be removed from the generated dataset.
- Full outer join: Select this type to preserve all data, including the rows in the source that do not contain key values.
- Select the two fields that you want to use to join. In the example, you would select the two fields that identify state values. Click Next.
- Select the fields that you want to include in the final dataset. Click Review.
- Click Add to Recipe.
- The generated dataset includes all of the fields you specified.
- For one of your key values, click the black bar gray bar and select the link for the number of affected rows, which loads them into the data grid. Review the missing values in each key column.
To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.
The generated command should look like the following:
D trans Type step p01Name Condition p01Value ISMISSING([State]) SearchTerm Delete rows