Page tree

 

Support | BlogContact Us | 844.332.2821

 

Contents:

The cloud-based version of Trifacta Wrangler is now available! Read all about it, and register for your free account.

Contents:


Trifacta® Wrangler provides multiple mechanisms for reviewing your column values and identifying patterns in the data format or similar values which mean the same thing. Through simple visual tools, you can select the patterns or clustered value to standardize and, when prompted, the patterns or values to use as their standard. As needed, you can apply formatting or structuring functions to the data for finer grain controls. This section summarizes the available methods of standardization, as well as their recommended uses.

Standardization Methods

You can use any of the following methods for standardizing values in your dataset's columns. Depending on the situation, you may choose to mix-and-match these methods. Details on these methods are below.

MethodDescriptionRecommended UsesHow to Use
By clustering

Trifacta Wrangler can identify similar values using one of the available algorithms for comparing values. You can compare values based on spelling or language-independent pronunciation.

  • Standardize values to correct spelling differences, capitalization, whitespace, and other errors.
  • Values must be consistent across rows of the column.
  • Primarily used for string-based data types.
Available through the Standardize Page
By pattern

Trifacta Wrangler can identify common patterns in a set of values and suggest transformations to standardize the values to a common format.

  • Standardize values to follow a consistent format, such as phone numbers or social security numbers.
  • Data type follows a somewhat consistent format and needs reshaping.
Available in the Patterns tab in Column Details Panel
By functionYou can apply one or more specific functions to cleanse your data of minor errors in formatting or structure.
  • Good method for improving the performance of pattern- or algorithm-based matching.
  • Some functions are specific to a data type, while others have more general application.
Edit column with formula in the Transform Builder.
Mix-and-matchYou can use combinations of the above methods for more complex use cases.
  • Combine function-based standardization for global changes to all values with cluster- or pattern-based standardization for individual value changes.
 

Invalid Values

These standardization techniques assume that your column contains only valid or empty values. 

Tip: Standardization may help to cut down the number of invalid values. Before you begin standardizing, however, you should select the red bar in the column histogram to review the values that are invalid for the current type and to fix them via suggestion if possible. For more information, see Find Bad Data.

Standardize Values by Clustering

Using one of the supported matching algorithms, Trifacta Wrangler can cluster together similar column values. You can review the clusters of values to determine if they should be mapped to the same value. If so, you can apply the mapping of these values within the application.

Example - Multiple methods of clustering

Source:

The following dataset includes some values that could be standardized:

RowIdValues
Row01Apple
Row02pear
Row03apple
Row04pair
Row05Åpple
Row06pare

When you standardize using a spelling-based algorithm, the following values are clustered:

Source ValueNew Value
  
Apple 
apple 
Åpple 
 Unclustered values
pear 
pair 
pare 

After you select the cluster of values at top, you can enter apple, in the right context panel to replace that cluster of values with a single string.

In the above, the unclustered values are dissimilar in spelling, but in English, they sound the same (homonyms). When you select the Pronunciation-based algorithm, these values are clustered:

Source ValueNew Value
  

pear

 
pair 
pare 
 Unclustered values
Appleapple 
appleapple 
Åppleapple 

When you select the top values clustered by pronunciation, you can enter pear in the right context panel. 

Results:

The six source values have been reduced to two final values through two different methods of clustering. See below for more information on the clustering algorithms.

Source ValueNew Value
  

pear

 pear
pairpear 
parepear 
  
Appleapple 
appleapple 
Åppleapple 

You can apply cluster-based standardization through the Standardize Page. See Standardize Page.

Clustering Algorithms

The following algorithms for clustering values are supported.

Similar strings

For comparing similar strings, the following methods can be applied:

Fingerprint:

The fingerprint method compares values in the column by applying the following steps to the data before comparing and clustering:

NOTE: These steps are applied to an internal representation of the data. Your dataset and recipe are not changed by this comparison. Changes are only applied if you choose to modify the values and add the mapping.

 

  1. Remove accents from characters, so that only ASCII characters remain.
  2. Change all characters to lowercase.
  3. Remove whitespace.
  4. Split the string on punctuation, any remaining whitespace, and control characters. Remaining characters are assembled into groups called tokens.
  5. Sort the tokens and remove any duplicates.
  6. Join the tokens back together.
  7. Compare all tokenized values in the column for purposes of clustering.

Fingerprint Ngram:

This method follows the same steps as those listed above, except that tokens are broken up based on a specific (N) number of characters. By default, Trifacta Wrangler uses 2-character tokens. 

Tip: This method can provide higher fidelity matching, although there may be performance impacts on columns with a high number of unique values.

Pronunciation

Values are clustered based on a language-independent pronunciation.

This method uses the double metaphone algorithm for string comparison. For more information, see Compare Strings.

Standardize Formatting by Patterns

For individual columns, Trifacta Wrangler can analyze column values for patterns and then provide suggestions for how to normalize the patterned values into a consistent format. For example, the same US phone number can be represented in any of the following methods:

555-1212
415-555-1212
4155551212
(415) 555-1212
+1 (415) 555-1212

Tip: Pattern-based standardization is useful for confirming values in a column to a specific format. This method is applicable to data types like phone numbers, dates, social security numbers, and to a lesser extend email addresses and URLs.

You can apply pattern-based standardization through the Patterns tab. See Column Details Panel.

Standardize Using Functions

The following functions can be useful for standardizing values. 

Functions for strings

All values can be converted to string, so these string functions can be applied to any column if its data type is converted to String data type. 

Tip: The clustering algorithms may apply some of these functions to values in your column for purposes of comparison.

CategoryFunctionDescription
String ConversionCHAR Function

Generates the Unicode character corresponding to an inputted Integer value. 

 UNICODE Function

Generates the Unicode index value for the first character of the input string. 

Case ConversionUPPER Function

All alphabetical characters in the input value are converted to uppercase in the output value. 

 LOWER Function

All alphabetical characters in the input value are converted to lowercase in the output value. 

 PROPER Function

Converts an input string to propercase. Input can be a column reference or a string literal. 

Cleanse FunctionsTRIM Function

Removes leading and trailing whitespace from a string. Spacing between words is not removed. 

 REMOVEWHITESPACE Function

Removes all whitespace from a string, including leading and trailing whitespace and all whitespace within the string. 

 REMOVESYMBOLS Function

Removes all characters from a string that are not letters, numbers, accented Latin characters, or whitespace. 

String Sizing FunctionsLEFT Function 

Matches the leftmost set of characters in a string, as specified by parameter. The string can be specified as a column reference or a string literal. 

 RIGHT Function

Matches the right set of characters in a string, as specified by parameter. The string can be specified as a column reference or a string literal. 

 PAD Function 

Pads string values to be a specified minimum length by adding a designated character to the left or right end of the string. Returned value is of String type. 

 String Comparison Functions See Compare Strings

Example:

Trifacta Wrangler supports nesting functions within each other. The following transformation performs some basic cleanup on all columns in your dataset that are of String cleanup. 

Transformation Name Edit column with formula
Parameter: Columns *
Parameter: Formula IFVALID($col,'String',LEFT(UPPER(TRIM($col)),32))

  • The Columns value is a wildcard, which in this case applies the transformation across all columns in the dataset (*).
  • In the Formula, you see a nested expression. If the value in the column is valid against String data type, then, do the following to the column value:

    NOTE: The IFVALID function tests each row value for validation against the specified data type. It does not test the column against the data type. See IFVALID Function.

     

    • The TRIM function removes leading and trailing whitespace, which may register as a difference between values. See TRIM Function.
    • The UPPER function then converts the output of the TRIM function to all uppercase. So, differences in capitalization are eliminated. See UPPER Function.
    • The LEFT function truncates the output of the UPPER function to a maximum of 32 characters. See LEFT Function.

The net result of this single step applied to all columns is to eliminate whitespace, convert to uppercase, and then truncate the length of each string to only 32 characters.

For more information, see Cleanse Tasks.

Functions for numbers

You can use the following functions to standardize numeric values. 

FunctionDescription
ABS Function

Computes the absolute value of a given numeric value. The value can be a Decimal or Integer literal or a reference to a column containing numeric values.

ROUND Function

Rounds input value to the nearest integer. Input can be an Integer, a Decimal, a column reference, or an expression. Optional second argument can be used to specify the number of digits to which to round.

TRUNC Function

Removes all digits to the right of the decimal point for any value. Optionally, you can specify the number of digits to which to round. Input can be an Integer, a Decimal, a column reference, or an expression. 

NUMFORMAT Function

Formats a numeric set of values according to the specified number formatting. Source values can be a reference to a column containing Integer or Decimal values.

Example:

For the NUMFORMAT function, you can specify the full format to which you want the numeric values in the column to confirm. In the following example, all values that contain a decimal point and match with the Decimal (Float) type are forced to add a value before the decimal. This step converts values like .00 to 0.00, which standardizes the format of your numbers.

Transformation Name Edit column with formula
Parameter: Columns *
Parameter: Formula IF(FIND($col, '.')>0, IFVALID($col, 'Float',NUMFORMAT(ROUND($col,2), '0.00')),$col)

  • The Columns value is a wildcard, which in this case applies the transformation across all columns in the dataset (*).
  • In the Formula, you see a nested expression, which is a bit more complicated than the preceding String example. 
    • The outer IF function tests if the FIND function returns a non-zero value when searching each column value for the period (.). Values that match could possibly be decimals and require further evaluation:
    • If the value in the column is valid against the Decimal (Float) data type then do the following:

      • ROUND the value to two decimal points. For more information, see ROUND Function.

      • Format the value in the following manner:

        0.00
      • The above format includes the two decimal points to which you rounded, adding any extra zeros if they are not present in the input rounded value.
      • Additionally, another zero is inserted in front of the decimal if it is missing in the output of the ROUND function.
      • For more information on number formats, see NUMFORMAT Function.

For more information, see Normalize Numeric Values.

Functions for dates

Since dates are structured patterns of string-based data, the best approach is to begin by using the Patterns tab in the Column Details panel. See below. 

For more detailed modifications, you can specify formatting strings that are applied as part of the DATEFORMAT function to the dates in your column.

FunctionDescription
DATEFORMAT Function

Formats a specified Datetime set of values according to the specified date format. Source values can be a reference to a column containing Datetime values.


For more information including examples on the DATEFORMAT function, see Format Dates

 

Your Rating: Results: PatheticBadOKGoodOutstanding! 1 rates

This page has no comments.