...
Excerpt |
---|
provides multiple mechanisms for reviewing your column values and identifying patterns in the data format or similar values which mean the same thing. This section summarizes the available methods of standardization, as well as their recommended uses. |
Standardization Methods
Through simple visual tools, you can select the patterns or clustered value to standardize and, when prompted, the patterns or values to use as their standard. As needed, you can apply formatting or structuring functions to the data for finer grain controls.
...
Standardization Methods
You can use any of the following methods for standardizing values in your dataset's columns. Depending on the situation, you may choose to mix-and-match these methods. Details on these methods are below.
Method | Description | Recommended Uses | How to Use |
---|
By clustering | can identify similar values using one of the available algorithms for comparing values. You can compare values based on spelling or language-independent pronunciation. | - Standardize values to correct spelling differences, capitalization, whitespace, and other errors.
- Values must be consistent across rows of the column.
- Primarily used for string-based data types.
| Available through the Standardize Page |
By pattern | can identify common patterns in a set of values and suggest transformations to standardize the values to a common format. | - Standardize values to follow a consistent format, such as phone numbers or social security numbers.
- Data type follows a somewhat consistent format and needs reshaping.
| Available in the Patterns tab in Column Details Panel |
By function | You can apply one or more specific functions to cleanse your data of minor errors in formatting or structure. | - Good method for improving the performance of pattern- or algorithm-based matching.
- Some functions are specific to a data type, while others have more general application.
| Edit column with formula in the Transform Builder. |
Mix-and-match | You can use combinations of the above methods for more complex use cases. | - Combine function-based standardization for global changes to all values with cluster- or pattern-based standardization for individual value changes.
| |
|
Invalid Values
These standardization techniques assume that your column contains only valid or empty values.
...
Example:
...
- The Columns value is a wildcard, which in this case applies the transformation across all columns in the dataset (
*
). In the Formula, you see a nested expression. If the value in the column is valid against String data type, then, do the following to the column value:
Info |
---|
NOTE: The IFVALID function tests each row value for validation against the specified data type. It does not test the column against the data type. See IFVALID Function. |
- The TRIM function removes leading and trailing whitespace, which may register as a difference between values. See TRIM Function.
- The UPPER function then converts the output of the TRIM function to all uppercase. So, differences in capitalization are eliminated. See UPPER Function.
- The LEFT function truncates the output of the UPPER function to a maximum of 32 characters. See LEFT Function.
...
For more information including examples on the DATEFORMAT function, see Format Dates.