When data is imported from another system, you might discover that some values are missing in it. In some cases, these values simply contain no content. In other cases, these values are non-existent. Depending on how the missing values entered the data, you may end up processing them in different ways. This section describes how to identify and manage missing data in your datasets.
NOTE: If you are unsure of the meaning of a column of data that contains missing values, you should attempt to review the source data or contact the individual who generated the data to identify why values may be missing and how to effectively manage them in and downstream systems.
When your dataset sample is evaluated, each column is validated against the column's type definition. Based on that validation, values in the column are categorized as valid, mismatched, or missing. These values are categorized in the data quality bar at the top of each column.
Values that are spaces (one or more presses of the
SPACEBAR) or tabs (one or more presses of the
TAB key) are not missing values.
Tip: To trim whitespace out of a column, use the following transformation:
This step may increase the number of missing values (for values that contain only whitespace characters) and the number of instances of matching values (for values that have spaces before and after an alphanumeric value).
\n) and newline (
\l) are considered missing.
In the data quality bar, missing values are identified in black:
Missing values in black
Tip: From the Transformer page, click the missing values in a column to see their count, highlight them in the rows of the data grid, and trigger a set of suggestions for your review.
Missing values can be sourced from a variety of issues:
Tip: When cleaning up missing data, you should look to work from bigger problems to smaller problems. If a higher percentage of a column's values have been categorized as missing data, you should look across affected rows to see if it's a wider problem. If other records look ok, you should consider deleting the column or figuring out how to manage the missing values, including populating them.
Data may also be considered missing if you don't have sufficient information about the data. For example, timestamps that do not have a timezone identifier may not be usable in the target system.
When you discover mismatched data in your dataset, you have the following basic methods of fixing it:
Identify if the column values are required.
Check the target system to determine if the field must have a value. If values are not required, don't worry about it. Consider deleting the column.
Remember that null values imported into are exported as missing values, which are easier to consume in most systems.
Check the column header and data type to determine if values are required. For example, in transactional data, a field called
coupon_code requires data only if every transaction is processed with one.
Insert a constant value. You can replace a missing value with a constant, which may make it easier to locate more important issues in the application.
Use a function. Particularly if the missing data can be computed, you can use one of the available functions to populate the missing values.
Copy values from another column. If a value from another column or a modified form of it can be used for the missing value, you can use the
set transform to overwrite the missing values.
Delete rows. Select the missing values bar and use the
delete transform to remove the problematic rows.
NOTE: Since missing data may not be an explicit problem, you should avoid deleting rows or the column itself until other options have been reviewed.
Delete the column. If the column data is unnecessary or otherwise unusable, you can delete the entire column from your dataset. Select Delete from the column drop-down.
Tip: Delete unnecessary columns as early as possible. Less data is easier to work with in the application and increases job execution performance.
NOTE: Generally speaking, inserting constants in place of missing values is not a recommended practice, especially if downstream consuming applications and individuals may not be known. In particular, you should not replace missing numeric values with a fixed numeric value, which will skew analysis. Use this method only if your entire data chain is aware of the constants.
Click the black missing values segment of the data quality bar for the column to fix.
Tip: Select a missing value in the data grid. Then, select the
You might seem something like the following:
The missing data is identified using the
row:ISMISSING reference. To apply a constant, replace the
NULL() reference with a constant value, as in the following:
Note that the single quotes around the value are required, since it identifies the value as a constant.
Tip: You can also use the
You can populate missing values with values from another column. In the following example, the
nickname column is populated with the value of
first_name if it is missing:
Particularly for numeric data, you can use functions to populate missing values. In the following example, missing values for the
unit_price column are derived from a computation of the
weight_kg column and the
Tip: Be careful using functions such as averages to compute missing values. These computations may factor outliers that have not yet been removed or may fail to account for local trends relative to the data. Study the values and their meaning in the column before performing replacements. When in doubt, a median value may be your best best, assuming outliers and spurious data have been properly addressed.
In some cases, a column may contain valid values, but the meaning of those values is missing from the data. For example, your data contains the following Timestamp information:
This timestamp information may be considered problematic for the following reasons:
There is no year information. If the target system contains multi-year datasets, it may cause issues. The month element should be interpretable by .
The following examples demonstrate how to insert this information into your timestamps.
On import, timestamp data may be classified as String data. For now, this is ok.
Tip: Particularly for dates, data is often easiest to manage as String data type. has a number of functions that you can deploy to manage strings. After the data has been properly formatted, you can change it to the proper data type. If you change data type immediately, you may have some challenges in reformatting and augmenting it. Do this step last.
After you have added back missing elements, you can change the data type to Date/Time through the data type drop-down for the column.
Before you begin reformatting your data, you should identify the target date format to which you want to match your timestamps. From the data type drop-down, select Date/Time. The dialog shows the following supported date formats:
Tip: When wrangling your data, you should start with the target structure or format of your data and work back to your source. This principle applies to both column management and overall dataset management.
Available Date/Time formats
NOTE: Each available option has a set of sub-options in the displayed drop-down.
In this timestamp example, the target format is the following:
dd-mm-yy hh:mm:ss (dd*shortMonth*yyyy*HH:MM:SS)
The easiest way to handle the insertion of year information is to split out the timestamp data into separate components and then to merge back the content together with the inserted year information. Since the above timestamp data essentially contains three separate fields (Day of Month, Month, and Time), you can use a split command to break this information into three separate columns. Highlight one of the spaces between Day of Month and Month and select the
split suggestion. The Wrangle step should look similar to the following:
Now, your data should be stored in three separate columns.
Tip: You may notice that new data types have been applied to the generated columns. The data may be easier to handle if all column types are converted to String type for now.
The next step involves merging all of these columns back into a single field, augmented with the appropriate year information. Select the columns in the order in which you would like to see them in the new timestamp field. In this case, you can select them in the order that they were originally listed. When all three columns are selected, choose the
You may notice that the data has been formatted without spaces (
19May02:45:38), and there is no year information yet. You can create new columns containing a year value (
myYear) then merge the columns together:
After you have inserted the year information and merged the columns, you should be able to change the column data type to the appropriate version of Date/Time.
Timestamps do not natively support different timezones, so this information must be stored in a separate column. For U.S. data, timezones can be determined based on the zip code.
NOTE: If missing metadata is not supported as part of the value in the target system, you can insert the metadata as a separate column and then apply the metadata to the data inside the target system.
For more information on inserting timezone metadata, see Add Lookup Data.