Page tree

 

Support | BlogContact Us | 844.332.2821

 

Contents:

This documentation applies to Trifacta Wrangler. Download this free product.
Registered users of this product or Trifacta Wrangler Enterprise should login to Product Docs through the application.

Contents:


After you have created or selected your dataset, a sample of its data is loaded into the Transformer page, where you begin your wrangling tasks. Through this interface, you build your transformation recipe and see the results in real-time as applied to the selected sample. When you are satisfied with what you see, you can execute a job against the entire dataset.

Goal

Tip: Before you begin transforming, you should know the target schema that your transformed data must match. A schema is the set of columns and their data types, which define the constraints of your dataset.

Your data transformation is complete when you have done the following:

  • Cleansed your data of invalid, missing, or inaccurate values
  • Modified your dataset to constrain its values to meet the target schema
  • Enhanced your dataset as needed with data from other datasets
  • Exported your dataset and transformation recipe for use in downstream systems

Sample

Loading very large datasets in the Trifacta® Application can overload your browser or otherwise impact performance, so the application is designed to work on a sample of data. After you have finished your recipe working on a sample, you execute the recipe across the entire dataset.

  • The default sample is the first set of rows of source data in the dataset, the number of which is determined by the Trifacta Application. In some cases, the entire dataset can be used as your sample. 

Methods of Building Recipes

The Trifacta Application supports the following methods for building recipes. These methods are listed in order of increasing technical requirements:

  1. Select something. When you make a selection in the Transformer page, you are prompted with a set of suggestions for steps that you can take on the selection or patterns matching the selection. 

    Tip: The easiest method for building recipes is to select items in the application. Over time, the application learns from your selections and prompts you with suggestions based on your previous use. For more information, see Overview of Predictive Transformation.

  2. Build steps. Using the Transform Builder, you can rapidly assemble recipe steps through a simple, menu-driven interface. When you modify a step, it is loaded by default in the Transform Builder, where you can make changes by selecting and entering values in pre-populated fields. 

    Tip: Use the Transform Builder for performing modifications to the transform you selected from a suggestion card.

  3. Enter recipe steps. If you are comfortable with entering programmatic commands, you should be able to quickly learn the syntax of Wrangle. Each method of building a transform step generates a command in this purpose-built data transformation language. Language commands can be entered through the Transform Editor panel in the Transformer page. 

    Tip: The Transform Editor is an effective method for learning the syntax of the language. After you have built a recipe that works by selecting cards or from the Transform Builder, you can review its exact text in the Transform Editor.

     

Cleanse

Generally, data cleansing addresses issues in data quality, which can be broadly categorized as follows:

  • Consistency. Values that describe the same thing should agree with each other. Numeric values should have the same precision. String values should be consistently structured to mean the same thing. 
  • Validity. Values should be constrained to the requirements of each field's data type. For example, a DateOfSale field should be a valid date. 
  • Reliability. Values in the same field in different records should mean the same thing. For example, the value 15 in the Temperature field of two different records should not mean Centigrade in one record and Fahrenheit in the other record. 

When data is initially imported into the Trifacta Application, it can contain multiple columns, rows, or specific values that you don't need for your final output. Specifically, this phase can involve the following basic activities:

  • Remove unused columns
  • Address missing and mismatched data
  • Change data types
  • Improve consistency, validity, and reliability of the data

First recipe steps:

When a dataset sample is first loaded into the Transformer page, the Trifacta Application attempts to split out the raw data to form regular, tabular data. If your data appears to contain a header row, the Trifacta Application can use this row for the titles of the columns.

These first steps are automatically added to the recipe in progress and applied to the sample data:

Figure: Transformer page with initial recipe steps

In the above image, three recipe steps have already been added automatically. These steps break out the sample data into rows and columns and establish a header.

  • The data resulting from these initial transforms is displayed in the data grid. See Data Grid Panel.
  • Your recipe is displayed in the Recipe panel on the right side. You might have to open this panel to see it. See Recipe Panel.
  • At the bottom of the screen, you can make menu selections to build commands in the Transform Builder. See Transform Builder.
  • For additional information on steps that can be created for you to parse your data, see Initial Parsing Steps.

Create a header row:

In most cases, the names of your columns are inferred from the first row of the data in the dataset. If not,in the Transform Editor panel, type:

header

If you need to specify a different row to use, you can specify a specific row number to use:

header sourcerownumber:4

To add this or any transform in development to your recipe, click Add to Recipe. This button is disabled if the step is invalid. 

Generate row numbers:

On the left side of the data grid, you might notice a set of black dots. If you hover over one of these, the original row number from the source data is listed. Since the data transformation process can change the number of rows or their order, you might want to retain the original order of the rows. Enter the following command in the Transform Editor to generate a new column called, rowId containing the original row numbers for your dataset.

derive value:SOURCEROWNUMBER() as:'rowId'

Tip: It's a good practice to create this kind of unique identifier for rows in your dataset. For more information, see Generate Primary Keys

Drop unused columns:

Your data might contain columns that are not of use to you, so it's in your interest to remove them to simplify the dataset. To drop a column, click the caret next to the column's title and select Drop.

Tip: If you are unsure of whether to drop the column, you can use the same caret menu to hide the column for now. Hidden columns do appear in the output.

Tip: You can also drop ranges of columns, too. See Remove Data.

Display only columns of interest:

You can choose which columns you want to display in the data grid, which can be useful to narrow your focus to problematic areas. On the left side of the screen, click the Column Browser () icon. These visual profiling tools provide immediate insight into general categories and unusual elements of your dataset, including errors and outlier values. For more information, see Column Browser Panel.

Review data quality:

After you have removed unused data, you can examine the quality of data within each column just below the column title. 

Figure: Column header

The horizontal bar, known, as the data quality bar, identifies the quality of the data in the column by the following colors:

ColorDescription
greenThese values are valid for the specified data type.
redThese values do not match those of the specified type.
blackThere are no values for the column in these rows.

Tip: When you select values in the data quality bar, those values are highlighted in the sample rows, and suggestions are displayed at the bottom of the screen in the transform cards to address the selected rows.

Transform Cards:

Trifacta Wrangler uses data inference techniques to examine your data and to automatically apply initial shaping and suggestions.

Tip: Where possible, you should try to create your transforms by selecting data and then selecting the appropriate transform card. In some cases, you might need to modify the details of the recipe in the Transform Editor.

In the following example, the missing values in the SUBSCRIBER_AGE column have been selected, and a set of transform cards is displayed. 

Figure: Selecting missing values

Depending on the nature of the data, you might want to keep, delete, or modify the values. Since the data is missing, the delete card has been selected.

  • To accept this suggest, click Add to Recipe.
  • You can modify the step if needed. An example is provided later.

Change data types:

If a column contains a high concentration of mismatched data (red), the column might have been identified as the wrong data type. For example, your dataset includes internal identifiers that are primarily numeric data (e.g. 10000022) but have occasional alphabetical characters in some values (e.g.  1000002A). The column for this data might be typed for integer values, when it should be treated as string values. 

Tip: Where possible, you should set the data type for each column to the appropriate type. The Trifacta Application does maintain statistical information and enable some transformation steps based upon data type. See Column Statistics Reference.

  • To change a column's data type, click the icon to the left of the column title. Select the new data type. Review the mismatched values for the column to verify that their count has dropped. For more information, see Supported Data Types.

Explore column details:

As needed, you can explore additional details about the values in a column, including useful statistics on the range of values based on the column's data type. 

Review histograms:

Just below a column's data quality bar, you can review a histogram of the values found in the column. In the following example, the data histogram on the left applies to the ZIP column, while the one on the right applies the WEB_CHAT_ID column.

Figure: Column data histogram

When you mouse over the categories in the histogram, you can see the corresponding value, the count of instances in the sample's column, and the percentage of affected rows. In the left one, the bar with the greatest number of instances has been selected; the value 21202 occurs 506 times (21.28%) in the dataset. On the right, the darker shading indicates how rows with ZIP=21202 map to values in the WEB_CHAT_ID column.

Tip: Similar to the data quality bar, you can click values in a data histogram to highlight the affected rows and to trigger a set of suggestions. In this manner, you can use the same data quality tools to apply even more fine-grained changes to individual values in a column.

For a list of common tasks to cleanse your data, see Cleanse Tasks.

Modify

After you have performed initial cleansing of your data, you might need to perform modifications to the data to properly format it for the target system, specify the appropriate level of aggregation, or other modification. When you select data in the Trifacta Application, suggested transformations are presented to you, either in the form of transform cards or suggested transforms in the Transform Editor. Select one, or create your own as needed.

Tip: Modification steps are often specific to the downstream use-case for the data. If your source dataset needs to satisfy multiple downstream uses, you might need to make modifications to satisfy each use case, which are in conflict with each other. It might be easier to cleanse first, copy the dataset, and then modify each to satisfy each use case.

In the following example, the improperly capitalized word BALTIMORE has been selected, so that you can change it to its propercase spelling (Baltimore). Those rows are highlighted in the row data, and a set of suggestions for how to fix has been provided in the cards at the bottom of the screen.

NOTE: If you do not see these transform cards, click the Grid Options () icon and select Card Suggestions, which toggles the display of the cards.

Figure: Selecting values to modify


Depending on the nature of your data, you might want to keep or change the values, or you can remove the problematic rows altogether. 

Tip: When you select one of the transform cards, the implied changes are previewed in the Transformer page, so you can see the effects of the change. This previewing capability enables you to review and tweak your changes before they are formally applied. You can always remove a transform step if it is incorrect or even re-run the recipe to generate a corrected set of results, since source data is unchanged. For more information, see Transform Preview.

Tip: This process of selecting data in a column's data quality bar or histogram of values is the recommended method for identifying problematic data in your dataset. You can apply this method to mismatched (red), missing (black) values, or data outliers across all of the columns of your dataset.

In this case, select the replace transform. However, there are a couple of minor issues with the provided suggestion.

  • Since the platform has no idea about the meaning of the selection, it might initially suggest removing the text altogether. In this case, you want to change the spelling.
  • In the card, the on parameter value contains the pattern used to identify the selection. In this case, it is selecting all values that are capitalized. For now, you only want to fix BALTIMORE.

So, some aspects of this transform must be changed. Click Modify.

Transform Builder:

When you modify a transform step, you can make changes in the Transform Builder, which is a simple, menu-driven interface for building transform steps:

Figure: Modifying steps in the Transform Builder

In the Transform Builder, you can replace the pattern in the on parameter with the specific string to locate: BALTIMORE. The new value, which is currently blank, can be populated with the replacement value: Baltimore.

Click Add to Recipe. The suggestion is added to the recipe and automatically applied to the data sample displayed in the Transformer page.

Transform Editor:

Through the Transform Editor panel, you can modify the exact transform steps of your recipe.

NOTE: The Transform Editor is considered an advanced feature of the application. However, it is a helpful interface for learning the syntax of the underlying language.

  1. In the following example, the Recipe panel has been opened on the right side of the screen, revealing the individual steps of the recipe. 
  2. The replace transform fro the previous step has been selected for editing in the Transform Builder. 
  3. Click Switch to editor.
  4. The selected step is displayed in the Transform Editor.

Figure: Transform Editor panel

In the above image, you can see how an individual transform is defined from scratch. If you need finer controls of your data transformations than those provided by the transform cards, you can use the Transform Editor to build recipe steps in Wrangle (a domain-specific language for data transformation). See Transform Editor Panel.

Trifacta Wrangler supports dozens of transforms and aggregation, logic, mathematical, statistical, and string functions. For more information, see Wrangle Language.Aggregate data:

The raw values in your dataset might be too fine-grained for use in your target system, or you might need to standardize all values to the same level of aggregation. For example, your data might be stored at the individual product level, when you need to use it at the brand level. For more information, see Aggregate Transform.

Enhance

Before you deliver your data to the target system, you might need to enhance or augment the dataset with new columns or values.

Lookup values:

In some cases, you might need to include or replace values in your dataset with other columns from another dataset. For example, transactional data can reference product and customer by internal identifiers. Through the Trifacta Application, you can create lookups into your master data set to retrieve user-friendly versions of customer and product IDs.

NOTE: The reference data that you are using for lookups must be loaded as a dataset into Trifacta Wrangler first.

To perform a lookup for a column of values, click the caret drop-down next to the column title and select Lookup....

Join datasets:

You can also join together two or more datasets based on a common set of values. For example, you are using raw sales data to build a sales commission dataset:

  • Your sales transaction dataset contains a column for the salesman's identifier, which indicates the employee who should receive the commission. 
  • You might want to join your sales transaction dataset to the employee dataset, which provides information on the employee's name and commission rate by the internal identifier. 
  • If there is no corresponding record in the employee dataset, a commission is not rewarded, and the sales transaction record should not be present in the commission dataset.

This commission dataset is created by performing an inner join between the sales transaction dataset and the employee dataset. In the Transform Editor Panel, click the Tools icon and select Join.

For a list of common workflows to enhance your dataset, see Enhance Tasks.

 

Your Rating: Results: PatheticBadOKGoodOutstanding! 3 rates

This page has no comments.