Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

MethodDescriptionRecommended UsesHow to Use
By clustering

D s product
can identify similar values using one of the available algorithms for comparing values. You can compare values based on spelling or language-independent pronunciation.

  • Standardize values to correct spelling differences, capitalization, whitespace, and other errors.
  • Values must be consistent across rows of the column.
  • Primarily used for string-based data types.
Available through the Standardize Page

D s advfeature

By pattern

D s product
can identify common patterns in a set of values and suggest transformations to standardize the values to a common format.

  • Standardize values to follow a consistent format, such as phone numbers or social security numbers.
  • Data type follows a somewhat consistent format and needs reshaping.
Available in the Patterns tab in Column Details Panel
By functionYou can apply one or more specific functions to cleanse your data of minor errors in formatting or structure.
  • Good method for improving the performance of pattern- or algorithm-based matching.
  • Some functions are specific to a data type, while others have more general application.
Edit column with formula in the Transform Builder.
Mix-and-matchYou can use combinations of the above methods for more complex use cases.
  • Combine function-based standardization for global changes to all values with cluster- or pattern-based standardization for individual value changes.
 

...

Using one of the supported matching algorithms, 

D s product
 can cluster together similar column values. You can review the clusters of values to determine if they should be mapped to the same value. If so, you can apply the mapping of these values within the application.

Example - Multiple methods of clustering

Source:

The following dataset includes some values that could be standardized:

...

When you standardize using a spelling-based algorithm, the following values are clustered:

...

After you select the cluster of values at top, you can enter apple, in the right context panel to replace that cluster of values with a single string.

In the above, the unclustered values are dissimilar in spelling, but in English, they sound the same (homonyms). When you select the Pronunciation-based algorithm, these values are clustered:

...

...

pear

...

When you select the top values clustered by pronunciation, you can enter pear in the right context panel. 

Results:

The six source values have been reduced to two final values through two different methods of clustering. See below for more information on the clustering algorithms.

...

pear

...

You can apply cluster-based standardization through the Standardize Page. See Standardize Page.

Clustering Algorithms

The following algorithms for clustering values are supported.

Similar strings

For comparing similar strings, the following methods can be applied:

Fingerprint:

The fingerprint method compares values in the column by applying the following steps to the data before comparing and clustering:

Info

NOTE: These steps are applied to an internal representation of the data. Your dataset and recipe are not changed by this comparison. Changes are only applied if you choose to modify the values and add the mapping.

 

  1. Remove accents from characters, so that only ASCII characters remain.
  2. Change all characters to lowercase.
  3. Remove whitespace.
  4. Split the string on punctuation, any remaining whitespace, and control characters. Remaining characters are assembled into groups called tokens.
  5. Sort the tokens and remove any duplicates.
  6. Join the tokens back together.
  7. Compare all tokenized values in the column for purposes of clustering.

Fingerprint Ngram:

This method follows the same steps as those listed above, except that tokens are broken up based on a specific (N) number of characters. By default, 

D s product
 uses 2-character tokens. 

Tip

Tip: This method can provide higher fidelity matching, although there may be performance impacts on columns with a high number of unique values.

Pronunciation

Values are clustered based on a language-independent pronunciation.

This method uses the double metaphone algorithm for string comparison. For more information, see Compare Strings.

D s advfeature

Standardize Formatting by Patterns

...

Custom Type MethodDescription
Dictionary file

You can upload a dictionary file containing the list of accepted values for the custom type.

Info

NOTE: This method is likely to be superseded by dictionaries that can be applied through the Standardize page.

For more information, see Create Custom Data Types.

Regular Expressions

A custom data type can be created based on a user-defined regular expression.

Info

NOTE: Regular expressions are powerful tools for creating matching patterns. They are considered developer tools.

For more information, see Create Custom Data Types Using RegEx.