After you have created or selected your dataset, the Transformer page is opened, where you begin your wrangling tasks on a sample of the dataset. Through this interface, you build your transformation recipe and see the results in real-time as applied to the sample. When you are satisfied with what you see, you can execute a job against the entire dataset.
Your data transformation is complete when you have done the following:
Tip: Before you begin transforming, you should know the target schema that your transformed data must match. A schema is the set of columns and their data types, which define the constraints of your dataset.
Tip: If you want to match up against the target schema, you can import a dataset to serve as the target schema to which you are mapping. For more information on this advanced feature, see Overview of Target Matching.
supports the following methods for building recipes. These methods are listed in order of ease of use:
Select something. When you make a selection in the Transformer page, you are prompted with a set of suggestions for steps that you can take on the selection or patterns matching the selection. You can select columns or one or more values within columns.
Tip: The easiest method for building recipes is to select items in the application. Over time, the application learns from your selections and prompts you with suggestions based on your previous use. For more information, see Overview of Predictive Transformation.
Toolbar and column menus: In the Transformer page, you can access pre-configured transformations through the Transformer toolbar or through the context menus for individual columns.
Tip: Use the toolbar for global transformations across your dataset and the column menu for transformations on an individual column.
Search and browse for transformations. Using the Search panel and the Transform Builder, you can rapidly assemble recipe steps through a simple, menu-driven interface. When you choose to add a step, you search for your preferred transformation in the Search panel. When one is selected, the transformation is pre-populated in the Transform Builder with parameter values based on your selections in the data. See Search Panel.
Tip: Use the Transform Builder for performing modifications to the transform you selected from the Search panel or a suggestion card. See Transform Builder.
Loading very large datasets in can overload your browser or otherwise impact performance, so the application is designed to work on a sample of data. After you have finished your recipe working on a sample, you execute the recipe across the entire dataset.
In some cases, the default sample might be inadequate or of the wrong type. To generate a new sample, click the Sample link in the upper-left corner.
NOTE: Collecting new random samples requires system resources and storage. You can collect a new random sample if you have included a step to change the number of rows in your dataset or have otherwise permanently modified data (keep, delete, lookup, join, or pivot operations). If you subsequently remove the step that made the modification, the generated sample is no longer valid and is removed. This process limits unnecessary growth in data samples.
On the right side of the screen, you can launch a new sampling job on your dataset. For more information, see Samples Panel.
Data cleansing tasks address issues in data quality, which can be broadly categorized as follows:
15in the Temperature field of two different records should not mean Centigrade in one record and Fahrenheit in the other record.
When data is initially imported, it can contain multiple columns, rows, or specific values that you don't need for your final output. Specifically, this phase can involve the following basic activities:
First recipe steps:
When a dataset sample is first loaded into the Transformer page, attempts to split out the raw data to form regular, tabular data. If your data appears to contain a header row, it can be used for the titles of the columns.
In the above image, some initial parsing steps have been applied to structure the data in tabular, but these steps are not added as formal parts of the recipe.
The data resulting from these initial transforms is displayed in the data grid. See Data Grid Panel.
Create a header row:
In most cases, the names of your columns are inferred from the first row of the data in the dataset. if you need to specify a different row, please complete the following:
Click the Search icon in the menu bar.
In the Search panel textbox, type:
Generate row numbers:
On the left side of the data grid, you might notice a set of black dots. If you hover over one of these, the original row number from the source data is listed. Since the data transformation process can change the number of rows or their order, you might want to retain the original order of the rows. To retain the original row numbers in a column called,
rowId, please complete the following:
rowIdin the New column name textbox.
Tip: It's a good practice to create this kind of unique identifier for rows in your dataset. However, some operations such as
Drop unused columns:
Your data might contain columns that are not of use to you, so it's in your interest to remove them to simplify the dataset. To drop a column, click the caret next to the column's title and select Delete.
Tip: If you are unsure of whether to delete the column, you can use the same caret menu to hide the column for now. Hidden columns do appear in the output.
Tip: You can also drop ranges of columns, too. See Remove Data.
Check column data types:
When a dataset is imported, attempts to identify the data type of the column from the first set of rows in the column. At times, however, type inference can be incorrect.
Tip: Before you start performing transformations on your data based on mismatched values, you should check the data type for these columns to ensure that they are correct. For more information, see Supported Data Types.
Display only columns of interest:
You can choose which columns you want to display in the data grid, which can be useful to narrow your focus to problematic areas.
In the Transformer toolbar at the top of the screen, click the Column View icon.
Tip: You can also toggle display of individual columns in the Transformer page. Click the Eye icon to review the visible columns.
These visual profiling tools provide immediate insight into general categories and unusual elements of your dataset, including errors and outlier values. For more information, see Column Browser Panel.
Review data quality:
After you have removed unused data, you can examine the quality of data within each column just below the column title.
The horizontal bar, known, as the data quality bar, identifies the quality of the data in the column by the following colors:
|green||These values are valid for the specified data type.|
|red||These values do not match those of the specified type.|
|black||There are no values for the column in these rows.|
Tip: When you select values in the data quality bar, those values are highlighted in the sample rows, and suggestions are displayed at the bottom of the screen in the suggestion cards to address the selected rows.
For more information, see Data Quality Bars.
uses data inference techniques to examine your data based on your selections to prompt you with suggested transformations.
Tip: Where possible, you should try to create your transforms by selecting data and then selecting the appropriate suggestion card. In some cases, you might need to modify the details of the recipe.
In the following example, the missing values in the
SUBSCRIBER_AGE column have been selected, and a set of suggestion cards is displayed.
Selecting missing values
Tip: When previewing a recipe step, you can use the checkboxes in the status bar to display only affected rows, columns, or both, which helps you to assess the effects of your step.
Depending on the nature of the data, you might want to keep, delete, or modify the values. Since the data is missing, the Delete card has been selected.
Change data types:
If a column contains a high concentration of mismatched data (red), the column might have been identified as the wrong data type. For example, your dataset includes internal identifiers that are primarily numeric data (e.g.
10000022) but have occasional alphabetical characters in some values (e.g.
1000002A). The column for this data might be typed for integer values, when it should be treated as string values.
Tip: Where possible, you should set the data type for each column to the appropriate type. does maintain statistical information and enable some transformation steps based upon data type. See Column Statistics Reference.
Explore column details:
Just below a column's data quality bar, you can review a histogram of the values found in the column. In the following example, the data histogram on the left applies to the
ZIP column, while the one on the right applies the
Column data histogram
When you mouse over the categories in the histogram, you can see the corresponding value, the count of instances in the sample's column, and the percentage of affected rows. In the left one, the bar with the greatest number of instances has been selected; the value
21202 occurs 506 times (21.28%) in the dataset. On the right, the darker shading indicates how rows with
ZIP=21202 map to values in the
Tip: Similar to the data quality bar, you can click values in a data histogram to highlight the affected rows and to trigger a set of suggestions. In this manner, you can use the same data quality tools to apply even more fine-grained changes to individual values in a column.
For a list of common tasks to cleanse your data, see Cleanse Tasks.
After you have performed initial cleansing of your data, you might need to perform modifications to the data to properly format it for the target system, specify the appropriate level of aggregation, or perform some other modification. When you select data, suggested transformations are presented to you as suggestion cards. Select one, or create your own transformation as needed.
Tip: Modification steps are often specific to the downstream use-case for the data. If your source dataset needs to satisfy multiple downstream uses, you might need to make modifications to satisfy each use case, which are in conflict with each other. It might be easier to cleanse first, create a reference for the recipe object, and then import the reference dataset in each flow for further modification. For more information, see Flow View Page.
In the following example, the improperly capitalized word
BALTIMORE has been selected, so that you can change it to its propercase spelling (
Baltimore). Those rows are highlighted in the row data, and a set of suggestions for how to fix has been provided in the cards at the bottom of the screen. See Suggestion Cards Panel.
Selecting values to modify
Depending on the nature of your data, you might want to keep or change the values, or you can remove the problematic rows altogether.
Tip: When you select one of the suggestion cards, the implied changes are previewed in the Transformer page, so you can see the effects of the change. This previewing capability enables you to review and tweak your changes before they are formally applied. You can always remove a transform step if it is incorrect or even re-run the recipe to generate a corrected set of results, since source data is unchanged. For more information, see Transform Preview.
Tip: This process of selecting data in a column's data quality bar or histogram of values is the recommended method for identifying problematic data in your dataset. You can apply this method to mismatched (red), missing (black) values, or data outliers across all of the columns of your dataset.
In this case, select the
replace transform. However, there are a couple of minor issues with the provided suggestion.
onparameter value contains the pattern used to identify the selection. In this case, it is selecting all values that are capitalized. For now, you only want to fix
So, some aspects of this transform must be changed. Click Edit.
When you modify a transform step, you can make changes in the Transform Builder, which is a simple, menu-driven interface for modifying your transformations:
Modifying steps in the Transform Builder
In the Transform Builder, you can replace the pattern with the specific string to locate:
BALTIMORE. The new value, which is currently blank, can be populated with the replacement value:
Baltimore. Click Add.
The step is added to the recipe and automatically applied to the data sample displayed in the Transformer page. For more information, see Transform Builder.
The raw values in your dataset might be too fine-grained for use in your target system, or you might need to standardize all values to the same level of aggregation. For example, your data might be stored at the individual product level, when you need to use it at the brand level. For more information, see Pivot Transform.
Before you deliver your data to the target system, you might need to enhance or augment the dataset with new columns or values from other datasets. These multi-dataset operations can greatly expand the capabilities of your wrangling workflows.
You can append a dataset of identical structure to your currently loaded one to expand the data volume. For example, you can string together daily log data to build weeks of log information. See Union Page.
In some cases, you might need to include or replace values in your dataset with other columns from another dataset. For example, transactional data can reference product and customer by internal identifiers. You can create lookups into your master data set to retrieve user-friendly versions of customer and product IDs.
NOTE: The reference data that you are using for lookups must be loaded as a dataset into first.
To perform a lookup for a column of values, click the caret drop-down next to the column title and select Lookup....
See Lookup Wizard.
You can also join together two or more datasets based on a common set of values. For example, you are using raw sales data to build a sales commission dataset:
This commission dataset is created by performing an inner join between the sales transaction dataset and the employee dataset. In the Search panel, enter
join. See Join Page.
For a list of common workflows to enhance your dataset, see Enrichment Tasks.
As part of the transformation process, you can generate and review visual profiles of individual columns and your entire dataset. These interactive profiles can be very helpful in identifying anomalies, outliers, and other issues with your data.
These profiles appear as: