provides multiple mechanisms for reviewing your column values and identifying patterns in the data format or similar values which mean the same thing. Through simple visual tools, you can select the patterns or clustered value to standardize and, when prompted, the patterns or values to use as their standard. As needed, you can apply formatting or structuring functions to the data for finer grain controls. This section summarizes the available methods of standardization, as well as their recommended uses.
You can use any of the following methods for standardizing values in your dataset's columns. Depending on the situation, you may choose to mix-and-match these methods. Details on these methods are below.
|Method||Description||Recommended Uses||How to Use|
can identify similar values using one of the available algorithms for comparing values. You can compare values based on spelling or language-independent pronunciation.
|Available through the Standardize Page|
can identify common patterns in a set of values and suggest transformations to standardize the values to a common format.
|Available in the Patterns tab in Column Details Panel|
|By function||You can apply one or more specific functions to cleanse your data of minor errors in formatting or structure.||Edit column with formula in the Transform Builder.|
|Mix-and-match||You can use combinations of the above methods for more complex use cases.|
These standardization techniques assume that your column contains only valid or empty values.
Tip: Standardization may help to cut down the number of invalid values. Before you begin standardizing, however, you should select the red bar in the column histogram to review the values that are invalid for the current type and to fix them via suggestion if possible. For more information, see Find Bad Data.
Using one of the supported matching algorithms, can cluster together similar column values. You can review the clusters of values to determine if they should be mapped to the same value. If so, you can apply the mapping of these values within the application.
The following dataset includes some values that could be standardized:
When you standardize using a spelling-based algorithm, the following values are clustered:
|Source Value||New Value|
After you select the cluster of values at top, you can enter
apple, in the right context panel to replace that cluster of values with a single string.
In the above, the unclustered values are dissimilar in spelling, but in English, they sound the same (homonyms). When you select the Pronunciation-based algorithm, these values are clustered:
|Source Value||New Value|
When you select the top values clustered by pronunciation, you can enter
pear in the right context panel.
The six source values have been reduced to two final values through two different methods of clustering. See below for more information on the clustering algorithms.
|Source Value||New Value|
You can apply cluster-based standardization through the Standardize Page. See Standardize Page.
The following algorithms for clustering values are supported.
For comparing similar strings, the following methods can be applied:
The fingerprint method compares values in the column by applying the following steps to the data before comparing and clustering:
NOTE: These steps are applied to an internal representation of the data. Your dataset and recipe are not changed by this comparison. Changes are only applied if you choose to modify the values and add the mapping.
This method follows the same steps as those listed above, except that tokens are broken up based on a specific (N) number of characters. By default, uses 2-character tokens.
Tip: This method can provide higher fidelity matching, although there may be performance impacts on columns with a high number of unique values.
Values are clustered based on a language-independent pronunciation.
This method uses the double metaphone algorithm for string comparison. For more information, see Compare Strings.
For individual columns, can analyze column values for patterns and then provide suggestions for how to normalize the patterned values into a consistent format. For example, the same US phone number can be represented in any of the following methods:
555-1212 415-555-1212 4155551212 (415) 555-1212 +1 (415) 555-1212
Tip: Pattern-based standardization is useful for confirming values in a column to a specific format. This method is applicable to data types like phone numbers, dates, social security numbers, and to a lesser extend email addresses and URLs.
You can apply pattern-based standardization through the Patterns tab. See Column Details Panel.
The following functions can be useful for standardizing values.
All values can be converted to string, so these string functions can be applied to any column if its data type is converted to String data type.
Tip: The clustering algorithms may apply some of these functions to values in your column for purposes of comparison.
|String Conversion||CHAR Function|
|Case Conversion||UPPER Function|
|Cleanse Functions||TRIM Function|
|String Sizing Functions||LEFT Function|
|String Comparison Functions||See Compare Strings.|
supports nesting functions within each other. The following transformation performs some basic cleanup on all columns in your dataset that are of String cleanup.
In the Formula, you see a nested expression. If the value in the column is valid against String data type, then, do the following to the column value:
NOTE: The IFVALID function tests each row value for validation against the specified data type. It does not test the column against the data type. See IFVALID Function.
The net result of this single step applied to all columns is to eliminate whitespace, convert to uppercase, and then truncate the length of each string to only 32 characters.
For more information, see Cleanse Tasks.
You can use the following functions to standardize numeric values.
For the NUMFORMAT function, you can specify the full format to which you want the numeric values in the column to confirm. In the following example, all values that contain a decimal point and match with the Decimal (Float) type are forced to add a value before the decimal. This step converts values like
0.00, which standardizes the format of your numbers.
.). Values that match could possibly be decimals and require further evaluation:
If the value in the column is valid against the Decimal (Float) data type then do the following:
ROUND the value to two decimal points. For more information, see ROUND Function.
Format the value in the following manner:
For more information, see Normalize Numeric Values.
Since dates are structured patterns of string-based data, the best approach is to begin by using the Patterns tab in the Column Details panel. See below.
For more detailed modifications, you can specify formatting strings that are applied as part of the DATEFORMAT function to the dates in your column.
For more information including examples on the DATEFORMAT function, see Format Dates.
You can create custom data types to use as a form of standardization. Values in a column that do not conform to the custom type are flagged as invalid and can be triaged accordingly.
NOTE: A custom data type does not inherently provide a means of standardizing the values. The values flagged as invalid must be converted to valid values or removed.
Custom data types can be created in either of the following ways:
|Custom Type Method||Description|
You can upload a dictionary file containing the list of accepted values for the custom type.
For more information, see Create Custom Data Types.
A custom data type can be created based on a user-defined regular expression.
For more information, see Create Custom Data Types Using RegEx.