Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0810

D toc

d-excerpt

The process of cleansing, enhancing, and transforming your data can introduce significant changes to it, some of which might not be intended. This page provides some tips and techniques for validating your dataset, from start to finish for your data wrangling efforts.

Data validation can be broken down into the following categories:

  • Consistency - Does your data fit into expected values for it? Do field values match the data type for the column? Are values within acceptable ranges? Are rows unique? Duplicated?
  • Completeness - Are all expected values included in your data? Are some fields missing values? Are there expected values that are not present in the dataset?

...

In the data quality bar at the top of a column, you can review the valid (green), mismatched (red), and missing (blackgray) values. 

When you click the red bar:

...

At the top of each column, the data quality bar includes a black gray bar indicating the number of cells in the column that do not contain values. This set of values includes missing values. 

Click the black gray bar to prompt for a set of suggestion cards for handling those values. 

...

  1. Create a reference dataset that contains a single instance of each item you are checking. In this example, it'd be a simple CSV file with the name of each state on a separate line. 

    Tip

    Tip: To your second dataset, you might want to add a second column containing the value true, which allows you to keep separate validation data from the columns that you join.

  2. Add this CSV file as a new dataset to your flow.
  3. Open your source dataset. In the Search panel, enter join datasets.
  4. In the Join window:
    1. Select the reference dataset you just created. Click Accept. Click Next.
    2. Select the type of join to perform:
      1. Right outer join: Select this join type if you want to delete rows in your source dataset that do not have a key value in the reference dataset. In the example, all rows that do not have a value in the State column would be removed from the generated dataset.
      2. Full outer join: Select this type to preserve all data, including the rows in the source that do not contain key values.
    3. Select the two fields that you want to use to join. In the example, you would select the two fields that identify state values. Click Next.
    4. Select the fields that you want to include in the final dataset. Click Review.
    5. Click Add to Recipe.
  5. The generated dataset includes all of the fields you specified.
  6. For one of your key values, click the black bar gray bar and select the link for the number of affected rows, which loads them into the data grid. Review the missing values in each key column.
  7. To remove these rows, select the missing value category in the data quality bar for the appropriate column and apply a delete statement.

  8. The generated command should look like the following:

    D trans
    Typestep
    p01NameCondition
    p01ValueISMISSING([State])
    SearchTermDelete rows

...