...
Method | Description | Recommended Uses | How to Use |
---|
By clustering | can identify similar values using one of the available algorithms for comparing values. You can compare values based on spelling or language-independent pronunciation. | - Standardize values to correct spelling differences, capitalization, whitespace, and other errors.
- Values must be consistent across rows of the column.
- Primarily used for string-based data types.
| Available through the Standardize Page |
By pattern | can identify common patterns in a set of values and suggest transformations to standardize the values to a common format. | - Standardize values to follow a consistent format, such as phone numbers or social security numbers.
- Data type follows a somewhat consistent format and needs reshaping.
| Available in the Patterns tab in Column Details Panel |
By function | You can apply one or more specific functions to cleanse your data of minor errors in formatting or structure. | - Good method for improving the performance of pattern- or algorithm-based matching.
- Some functions are specific to a data type, while others have more general application.
| Edit column with formula in the Transform Builder. |
Mix-and-match | You can use combinations of the above methods for more complex use cases. | - Combine function-based standardization for global changes to all values with cluster- or pattern-based standardization for individual value changes.
| |
|
Invalid Values
These standardization techniques assume that your column contains only valid or empty values.
...
Example:
...
- The Columns value is a wildcard, which in this case applies the transformation across all columns in the dataset (
*
). In the Formula, you see a nested expression. If the value in the column is valid against String data type, then, do the following to the column value:
Info |
---|
NOTE: The IFVALID function tests each row value for validation against the specified data type. It does not test the column against the data type. See IFVALID Function. |
- The TRIM function removes leading and trailing whitespace, which may register as a difference between values. See TRIM Function.
- The UPPER function then converts the output of the TRIM function to all uppercase. So, differences in capitalization are eliminated. See UPPER Function.
- The LEFT function truncates the output of the UPPER function to a maximum of 32 characters. See LEFT Function.
...