Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r097
d-docdevnotes

D migrated
pageFind and Fix Bad Data

D migrated
pageClean and Enhance Your Data

d-toc

Excerpt

You might encounter problems with how data has been structured or formatted that you must fix prior to providing the content to your target system. You can use the methods in this section to locate problems with the content or data typing of your data. 

...

Locate mismatched values

...

In the 

D s webapp
, it is very easy to identify where there are errors in your data. What is truly innovative is how you correct them:

  1. Identify missing or mismatched data by color-coded bars in column data. 
  2. Select a bar. 
  3. Suggestions are offered in a set of cards on the right panel.
  4. Click a suggestion, and immediately see the effects of the suggested transformation previewed in the data grid. 
    1. If the transformation needs tweaking, you can edit the transformation as needed.
    2. If the transformation is not the correct one, click another suggestion.
  5. When satisfied, you add the transformation, and your sample of data is transformed. 

Image Added

D caption
Select errors in your data, and review AI-driven suggestions for how to correct. Make the change on the spot. 

Through this series of seeing, selecting, and refining issues in your sampled data, you can address basic errors in data mismatches, missing data, non-standard values, outlier values, and much more to improve the overall consistency and quality of your data. 

Find bad data

In the Transformer page, above each column of data is a data quality bar and histogram. 

The top bar is the data quality bar. The data quality bar segments the values found in the column into three color-coded bands:

Color barDescription
greenValid values for the current data type of the column
redInvalid values for the current data type of the column
blackMissing values could be empty or null.

Image Added

D caption
Mismatched values in a column are indicated in red

Change column data type

In the image above, you can identify the data type of the column based on the icon to the left of the column name (POS_Sales). In this case, the data type is Decimal. 

In some cases, invalid data can be fixed by simply changing the column data type. You can click the current data type indicator to review and select a more appropriate data type.

Tip

Tip: You can change the data type of the column by click the data type icon for the column.

Tip

Tip: No value is invalid for the String data type.


Image Added

D caption
Change column data type

Find outlier data

You can explore the details of a column of data to review statistical metrics on the data and to locate outlier values. In the column menu, select Column Details.

Image Added

D caption
Column Details
Tip

Tip: When these bars are clicked or SHIFT-clicked, the selected values are used to prompt suggestions for how to transform them.

Tip

Tip: You can explore the patterns in the data in the Patterns tab, where you can also use these patterns to standardize the formatting of your data.

Fix Mismatched Values

When

D s product
rtrue
 evaluates a dataset sample, it interprets the values in a column against its expectations for the values. Based on the column's specified data type and internal pattern matching, values are categorized as valid, mismatched, or missing. These value categories are represented in a slender bar at the top of each column. 

...

Tip

Tip: Before you start performing transformations on your data based on mismatched values, you should verify the data type for these columns to ensure that they are correct. The type against which values are checked is displayed to the upper left of the data quality bar. Below, the data type is ZIP for U.S. Zip code data. For more information, see Supported Data Types.

D caption
typefigure
Mismatched values in red

...

Info

NOTE: Remember that you are working on a sample of your data. If the sample indicator at the top left corner of the Transformer page does not indicate Full Data, then some values in your full dataset may not be represented in the sample displayed in the gridFor small datasets, the Initial Data sample includes all rows of the dataset and is unsampled.

  • From the Transformer page, click the mismatched values in a column's data quality bar to see their count, highlight them in the rows of the data grid, and trigger a set of suggestions for your review.
  • To refine the data grid view, click the Show Only Affected Rows checkbox in the status bar at the bottom of the screen. Only the rows that are affected by the previewed transform are displayed.

    Tip

    Tip: This step highlights specific values that are mismatched. You can take note of individual values.

  • To locate a specific value, click the Filters icon on the right side of the screen. In the Rows tab, enter the specific value to locate. Rows containing this value are highlighted. Back in the data grid, you can select one of these highlighted values to be prompted for suggestions.   

...

Tip

Tip: You can also use the IFMISMATCHED function to test for mismatched values. Unlike the above construction, however, IFMISMATCHED does not support an else clause when the value does match the listed data type. For more information, see IFMISMATCHED Function.

Bad data typing

Tip

Tip: Particularly for dates, data is often easiest to manage as String data type.

D s product
has a number of functions that you can deploy to manage strings. After the data has been properly formatted, you can change it to the proper data type. If you change data type immediately, you may have some challenges in reformatting and augmenting it. Do this step last.

...

Tip

Tip: If possible, you should review and refer to an available schema of your dataset, as generated from the source system. If the data has also been mis-typed in the source system, you should fix it there as well, so any future exports from that system show the correct type.

D s also
inCQLtrue
label((label = "data_type") OR (label = "mismatched") OR (label = "transformation_ui"))